Smart Technology Dumb Mistakes

Technology has gotten very complicated, and as complexity increases, so do high profile mistakes. Being in technology, I always feel a trace of “there but for the grace of God go I” when a major blunder hits the news media. Not that I’ve ‘gone there’, but I understand how it happens. What would you, as an IT manager, say to your boss and coworkers if you had worked at Knight Capital, and thanks to a one line coding error your programming department was personally responsible for losing $440,000,000 in 45 minutes?

The company is left in a shambles, fortunes are lost, and there is simply no excuse. I guess that’s life in the ‘high frequency trading’ lane.

Further proving the adage that “to err is human, but to really screw up you need a computer” was BATS, an electronic stock exchange that botched their own IPO, and of course NASDAQ blowing the Facebook IPO. I notice a trend--these programming and hardware glitches are increasing in frequency.

Most of us don’t have to worry about making mistakes that will land us on the front page of the Wall Street Journal—we simply don’t work at places with the potential to lose such colossal sums of money. It’s like BP spilling all that oil into the Gulf of Mexico. You can’t make a mistake of that magnitude without first having the ability to drill for oil miles under the ocean. I’m not going to create an environmental disaster like that working for a small consulting firm or a midsized publisher because there’s no oil in sight. However, even at medium size companies big mistakes can happen. They just aren’t as high profile.

So how do you prevent mistakes in IT? There isn’t one answer, and there also unfortunately isn’t always a foolproof answer either. Sometimes things happen beyond your control, but the goal is to minimize the potential for that to happen. Think of driving a car. Even the most careful person makes a mistake every five years or so, and it’s just lucky that nothing happens…that stop sign you didn’t see, or the crosswalk you missed while fiddling with the radio. 99.9% of the time we get away with it, but not everyone gets away with it every time.

First, always make sure you have a working backup—it’s amazing how many times backups are being run, but when needed it turns out they weren’t working properly. Log files don’t tell the story. You must test your backups on a continual basis, by doing restores. Will this solve every problem? No—Knight Capital had a working backup, but a backup isn’t a time machine. They lost all that money in real time, so a backup was irrelevant. But, if your company’s email crashes, a backup will save your job…as long as the backup is working.

Going to the cloud is a possible solution—up there the backups work. In fact, they work a little too well, which is the problem with keeping your data “in the cloud”. Once it’s up there, it’s up there forever. Whatever is saved to the cloud is forever available to any sort of forensic study. That email you sent three years ago, which reads differently than what you meant? It’s still up there, and will show up in court if it’s relevant. You cannot get rid of data stored in the cloud any more than you can remove something from the Internet permanently, once it's gone viral.

In midsize firms a lot of the problems occur in the software, and aren’t noticed until much of the damage is done. Typical of problems like this is orders come in off the Internet, but aren’t booked properly into the system. By the time it bubbles up from customer service, there are hundreds of orders unfulfilled, resulting in angry customers and stock outages. You prevent this sort of situation in two ways:

First, there should always be balancing in place: i.e. orders entered = orders processed + orders backordered (or something like that, accounting for holds, deletes, etc.), and then that has to be balanced against other systems.  Exception reports are needed to ensure your systems tie together.

Secondly, and perhaps more importantly, you need to do what is done all too infrequently in companies—make your data available to your managers, who can spot issues. Great reporting will catch a lot of issues before they bloom into major problems. All too often companies use their ERP and order entry systems just to process orders, monitor stock levels, and similar daily tasks. That’s not going to catch problems. Better to put the effort and work into generating on demand reports for managers, so they can follow what’s going on. Who is most likely to catch an issue like orders for a marketing campaign not being processed?--The marketing manager who dreamed up the program, that’s who. The order entry person won’t notice, and in fact might have the orders piled up on a desk pending finding out how to process the orders. Prevent those sorts of issues by making information available to responsible managers.

So what about Knight Capital—how could that have been prevented? I don't know exactly what happened, but my educated guess is that they did in fact test the new software extensively, but installed it improperly, leaving the much lamented "one line of code" from the old system still in the live environment. The only hope to catch this type of error, other than not making it in the first place, is to run parallel, something which is both expensive and time consuming, so generally skipped. Of course, with hindsight it should have been done, but it’s the old problem that was pointed out in N. N. Taleb’s book, The Black Swan—“heroes” are people who clean up after the mess has been made. The other heroes, those who prevent such occurrences, are often vilified rather than honored, because since they prevent the problem, it doesn’t happen, so all the time and money spent on prevention looks wasted. This is why IT is sometimes a thankless job……but if you’re in a situation where nearly half a billion dollars can be lost in less than an hour, don’t allow yourself to be rushed and run parallel.  

IT is becoming more and more complex, but keep it as simple as you can—if no one knows what the system is doing, it’s a lot harder to catch problems. Avoid the trap of simply installing software and figuring it knows best. You, as managers and employees, need to know what software is doing and how it is working. Otherwise, you won’t know if it’s working properly or not until it’s too late.