Why IT goes wrong

This is the third in a short series of commentaries that look at when, where, and why IT fails in its support for business operations and objectives.

In the first commentary we looked at thirteen incidents that severely disrupted organisation, their customers, and the public. In the second commentary we identified (from afar) the processes that were most likely to have broken down to cause the incidents.

In this commentary we ask why those processes failed. We were not close to most of the incidents, so we need to speculate a little on the root causes. However, we have been around this IT industry for a while and have seen the same mistakes being made and the same processes breaking down year after year, decade after decade. In most cases the technologies are irrelevant – we have seen the same problems occur in the mainframe world, the minicomputer world, the client-server world, and the dot com world.

Before we discuss some of the root causes, let’s recap the examples:

• A pharmaceutical company wrote off nearly $17.2 million in missing funds due to IT "discrepancies". A short time later the CEO and CFO were replaced. The pharmaceutical company’s accounting and IT auditors should have been able to pick up IT discrepancies – whether they were caused by a project or “evolved” during day to day operations. The process failure? Risk management.
• A drug company was forced into bankruptcy by a series of operational and project blunders. The drug company’s business strategists and CIO were responsible for taking high risks in a fragile business environment. The process failure? Business/IT strategy, enterprise risk management.
• A late, over budget system was introduced to “streamline and simplify” importers’ dealings with a government agency. Within days there was a severe backlog of containers in seaports during a critical period for importers in the lead-up to a holiday season. The imports system evolved into an IT implementation for the government agency – the agency’s clients (the importers) appear to have been consulted insincerely and their very real concerns ignored. The process failure? Business/IT planning, project management, stakeholder engagement.
• A telco spent $500m+ on billing software. It is still not right, and the same telco has announced a replacement programme. The software made the front pages when it sent erroneous final notices to the relatives of long-dead customers. Perhaps poor requirements management but certainly poor vendor management. The process failure? Project management.
• A large bank was pushed by a software vendor into early adoption of an untested new version; the software took out the automated teller machines, then allowed cardholders to withdraw cash without debiting their accounts. Other banks chose to test the software more thoroughly and detected the bug. The process failure? IT architecture, project management, capability management, and operations.
• A government agency applied its annual regulatory changes to old and unstable core systems. The systems first overcharged members of the public, then made too many refunds, then overcharged those who received incorrect refunds, and finally got it right on the fourth attempt. The government agency’s regulatory changes were applied at short notice to applications that were known to be old, poorly maintained, and fragile. Senior business executives had blocked funding requests for major application upgrades over the previous ten years, but still insisted on very short lead-time changes. The process failure? Business strategy, business/IT planning, capability management.
• A major new stock exchange system was 11 years late and 13,200% over budget. The process failure? Probably in all areas, but the responsibility must lie with the Board of Directors for allowing this debacle to drag on for so long.
• A new emergency services system was introduced on time and on budget, but the system and its backup locked. The emergency service (in one of the world’s largest cities) reverted to a manual system that restricted the ability of the service to respond quickly and as a result placed lives at risk. (This debacle was repeated by an emergency service in another large city on the other side of the world only months later.) The process failure? IT architecture, project management, capability management, IT operations, risk management.
• An IT infrastructure upgrade increased in cost by 150% (from about $US 2Bn to about $US 5Bn – only 18 months into a ten year project. The process failure? This was an extended comedy of errors, with apparently little business leadership, little risk management, little process control, technology-led IT planning, unfettered, demand-driven requirements, unskilled negotiators, uncontrolled vendors, and no escape clauses.
• The opening of a new airport was delayed 16 months by late delivery of revolutionary software. As a result the airport's planners’ bond rating was demoted to junk and the organisation lost $1.1 million a day in interest and operating costs. The airport software was revolutionary, but posed a high business risk in the circumstances. The process failure? The initial business direction does not appear to have been constrained by intelligent risk management.
• Most of the desktop computers in a government welfare agency were paralysed for four days when a failed operating system upgrade took them offline. The outage, covering 75 percent to 80 percent of the agency’s 80,000 PCs, was one of the largest in the country’s history. The outage disconnected staff e-mail, benefits processing, and connectivity to critical information and systems. The welfare agency’s desktop computer failure was caused by an unintended release of unready infrastructure software. The process failure? This occurred in the “escrow” capability management zone that exists between the project and IT operations. It was a defect in the quality process (in an organisation that had purportedly achieved some level of quality certification).
• In the same country a “computer crash” in another organisation prevented pensioners from collecting benefits payments. The computer crash that prevented benefit payments was caused by distribution of software onto a platform that had not been updated to the minimum platform requirements. The process failure? Project management, capability (configuration) management, and IT operations.
• In yet another organisation in the same country a call centre and its systems for processing applications for welfare payments ran so slowly that “up to two thirds” of callers (in at least one region) were unable to get through, and there is evidence that once through, payments took up to six weeks after applications were lodged. (Presumably, if people are needy they actually do need the payments as soon as possible!). The slow call centres were caused by the same problem – updated software that had not been fully tested on all the configurations that were in use across an agency with hundreds of branches and call centres. The process failure? As above, project management, capability (configuration) management, and IT operations.

If you look carefully at these incidents you will notice that the technology components themselves (the applications, networks and infrastructure) were relatively reliable. The problems in most of these very high profile cases occurred in the governance, risk management, and IT management processes. Even where the problem surfaced in technology failure, the management processes can be seen clearly as the root causes.

Why is it so?

Surely after 40 or so years of theories around general management and IT management, countless management fads, several generations of automated management tools (especially those classic hyperboles – so common in our industry – management information and business intelligence) we shouldn’t be making the same mistakes that we did 30 and 40 years ago!

For the most part, the technologies - and technologists – themselves are good quality and can produce quality results when applied to a clearly defined problem.

What’s the problem, then?

We believe the answer lies in three common behaviours that in themselves have nothing to do with management, or technology, or management of technology:

• Laziness,
• Arrogance, and
• Greed

Before you say “yah, just another nutcase having a rant and wasting my time”, have a look at what happened in those 13 disasters:

• A pharmaceutical company wrote off nearly $17.2 million in missing funds due to IT "discrepancies". A short time later the CEO and CFO were replaced. The pharmaceutical company’s accounting and IT auditors should have been able to pick up IT discrepancies – whether they were caused by a project or “evolved” during day to day operations. The process failure? Risk management. Were the accountants and auditors too lazy (or too interested in fees) to inspect the systems properly?
• A drug company was forced into bankruptcy by a series of operational and project blunders. The drug company’s business strategists and CIO were responsible for taking high risks in a fragile business environment. The process failure? Business/IT strategy, enterprise risk management. What greedy, arrogant strategist embarked on this high risk adventure? Was the CIO too lazy or too greedy for options to stop the projects?
• A late, over budget system was introduced to “streamline and simplify” importers’ dealings with a government agency. Within days there was a severe backlog of containers in seaports during a critical period for importers in the lead-up to a holiday season. The imports system evolved into an IT implementation for the government agency – the agency’s clients (the importers) appear to have been consulted insincerely and their very real concerns ignored. The process failure? Business/IT planning, project management, stakeholder engagement. Wow – what were these people on? Too lazy or too arrogant to consult their stakeholders? Too arrogant to consider the impacts of a half-baked system? Too interested in protecting their public sector pension schemes?
• A telco spent $500m+ on billing software. It is still not right, and the same telco has announced a replacement programme. The software made the front pages when it sent erroneous final notices to the relatives of long-dead customers. Perhaps poor requirements management but certainly poor vendor management. The process failure? Project management. In this case there were vendors involved. Were they more interested in taking orders for more billable time than looking critically at what was being asked? Where were the senior Telco managers when the project went many times over budget? Too lazy to get involved? More interested in protecting their pensions?
• A large bank was pushed by a software vendor into early adoption of an untested new version; the software took out the automated teller machines, then allowed cardholders to withdraw cash without debiting their accounts. Other banks chose to test the software more thoroughly and detected the bug. The process failure? IT architecture, project management, capability management, and operations. We know this was laziness on the part of the systems programmers, and greed on the part of the vendor’s systems engineers (who were paid bonuses for selling the upgrade – but only after it was installed, regardless of whether it was successful)
• A government agency applied its annual regulatory changes to old and unstable core systems. The systems first overcharged members of the public, then made too many refunds, then overcharged those who received incorrect refunds, and finally got it right on the fourth attempt. The government agency’s regulatory changes were applied at short notice to applications that were known to be old, poorly maintained, and fragile. Senior business executives had blocked funding requests for major application upgrades over the previous ten years, but still insisted on very short lead-time changes. The process failure? Business strategy, business/IT planning, capability management. This was arrogance on the part of the IT practitioners who had done it all before, and who ignored warning signs that were apparent the previous year. They were also too lazy (and perhaps too interested in their government pensions) to inform the head of the agency – in plain language – what would happen if the core systems were not refreshed. Instead it was left to external advisors to be the bearers of bad news, and a couple of years later the culprits retired with huge pensions after gaining promotions.
• A major new stock exchange system was 11 years late and 13,200% over budget. The process failure? Probably in all areas, but the responsibility must lie with the Board of Directors for allowing this debacle to drag on for so long. If we focus on the Board (the rot probably extended throught the organisation) they were probably too lazy to get involved in a project, and too interested in their emoluments to ask for information (in most western jurisdictions, once a director – whether executive director or non-executive director - is informed, he/she is legally required to act on the information).
• A new emergency services system was introduced on time and on budget, but the system and its backup locked. The emergency service (in one of the world’s largest cities) reverted to a manual system that restricted the ability of the service to respond quickly and as a result placed lives at risk. (This debacle was repeated by an emergency service in another large city on the other side of the world only months later.) The process failure? IT architecture, project management, capability management, IT operations, risk management. We know about only one of these. The emergency services staff were not prepared to make decisions – it was (and could still be) part of the culture to avoid decisions and thereby avoid accountability if something bad happened. Laziness? Greed?
• An IT infrastructure upgrade increased in cost by 150% (from about $US 2Bn to about $US 5Bn – only 18 months into a ten year project. The process failure? This was an extended comedy of errors, with apparently little business leadership, little risk management, little process control, technology-led IT planning, unfettered, demand-driven requirements, unskilled negotiators, uncontrolled vendors, and no escape clauses. We’re not sure about this one, but certainly there were signs that decisions were avoided (thereby avoiding any threat to those pensions), and there were indications of vendor greed – jacking up prices on commodity components.
• The opening of a new airport was delayed 16 months by late delivery of revolutionary software. As a result the airport's planners’ bond rating was demoted to junk and the organisation lost $1.1 million a day in interest and operating costs. The airport software was revolutionary, but posed a high business risk in the circumstances. The process failure? The initial business direction does not appear to have been constrained by intelligent risk management. Again, we’re not sure about this one. Perhaps arrogance (we’re world class civil engineers and we could write that bit of software blindfolded so what’s the problem?) and greed (look, we’re really jolly good at writing experimental software so give us that contract and we’re sure we’ll find a way to get it written by the time you’re ready with the rest of the airport. When was that? October? This year?)
• Most of the desktop computers in a government welfare agency were paralysed for four days when a failed operating system upgrade took them offline. The outage, covering 75 percent to 80 percent of the agency’s 80,000 PCs, was one of the largest in the country’s history. The outage disconnected staff e-mail, benefits processing, and connectivity to critical information and systems. The welfare agency’s desktop computer failure was caused by an unintended release of unready infrastructure software. The process failure? This occurred in the “escrow” capability management zone that exists between the project and IT operations. It was a defect in the quality process (in an organisation that had purportedly achieved some level of quality certification). Laziness.
• In the same country a “computer crash” in another organisation prevented pensioners from collecting benefits payments. The computer crash that prevented benefit payments was caused by distribution of software onto a platform that had not been updated to the minimum platform requirements. The process failure? Project management, capability (configuration) management, and IT operations. Laziness. Where was the configuration management? Where were the stress tests for the new software on the old configuration? Arrogance. “Those provincial hicks will wear what we give them. What’s the worst that can happen?”
• In yet another organisation in the same country a call centre and its systems for processing applications for welfare payments ran so slowly that “up to two thirds” of callers (in at least one region) were unable to get through, and there is evidence that once through, payments took up to six weeks after applications were lodged. (Presumably, if people are needy they actually do need the payments as soon as possible!). The slow call centres were caused by the same problem – updated software that had not been fully tested on all the configurations that were in use across an agency with hundreds of branches and call centres. The process failure? As above, project management, capability (configuration) management, and IT operations. Laziness and greed – as above.

If you think your organisation is an exception, think again – this time looking through a “behavioural” lens. I would be very surprised if there is any organisation anywhere in the world that is not led by at least some people who are lazy, arrogant or greedy enough to cause serious problems.