In my last article, I talked about how in the middle of the annual shareholders meeting, our corporate email was hit by the Sobig virus and I was in an embarrassing and dangerous situation.
In dealing with the virus that infested our email and other applications I learned it's good to be prepared, but you can't be prepared for every possible thing. Stuff happens. And when stuff happens, the process you use to respond is as important as the actual things you do. It's critical to have a simple process to guide your actions. When you need to act quickly, complex procedures inevitably break down. And a broken procedure in a high-stress situation is like losing control of a race car as it goes around a corner at 200 miles an hour.
The simple process I used has three steps, because in any situation, there are always only three things you ever need to do. First, you define what is happening and what you want to do about that (your goal). Then you design a way to get from where you are to where you need to be to accomplish your goal (your action plan). And finally, you execute the plan and build or do whatever is needed to accomplish the goal.
Afterwards, when things settle down, there is actually a fourth step, which is to review what happened and see what worked and what didn't. Then you can remember what you learned and use it the next time something similar happens.
The Sobig F virus proved to be a tricky devil. Several times that first day we did sweeps and scans of all our servers and it seemed like we had detected and erased all copies of the virus, only to have an undetected copy lurking in some obscure directory come to life again and start replicating itself wildly. By the end of the first day we realized there was no quick fix. We had to shut down our email system entirely and disable access to certain server comm ports and URLs.
We scanned and cleaned each server and only booted up the next server after we knew the previous one was entirely clean. We had to do this to all the servers running internal business applications as well. That took much of the second day. And all the while we communicated openly with our business units and our customers and suppliers to let them know what was happening and help them out when they found some of their servers were infected from emails we sent out.
As we tackled this problem, we were clear about our goal: Get things back to normal. But we went through the design and build steps several times. Instead of getting stuck on analysis, we tried things and we learned. We communicated with people outside the company and got useful advice. We kept updating our solution design as we progressed. We stayed focused and worked in shifts around the clock. By the middle of the second day we had the situation under control, and by the morning of the third day all systems and email were back up and performing as they should.
After we restored the system, we also had to restore our reputation with the company and the shareholders. We were able to do that quicker than I feared, because I owned up to what was happening right away and did not waste time with excuses that nobody wanted to hear anyway. We were clear about what we were going to do to fix the problem, and we did fix the problem over the next two days as we said we would. People know the IT world is a wild ride these days. They cut us some slack on this breach because it was the first breach of that magnitude, and we learned from it and put new procedures in place. Nothing like it happened again. If there had been another breach of that magnitude it would have been a reason for my dismissal. I took the experience seriously.
In our review after things settled down, we identified some lessons learned and used them going forward. They seemed mostly pretty obvious afterwards, but we hadn't seen them or heeded them before the breach. We learned to isolate our key internal application systems and their databases from easy access by email or other internal systems. We started using temporary data files or central data warehouses to move data between systems and scan the heck out of that data before letting it move between systems. We also tightened up email protocols and started blocking suspicious attachments.
Yet after all was said and done, we knew we could never put enough technology and procedures in place to cover all potential threats. It was the power of a simple, effective, problem-solving process that everyone understood and everyone used that got us through the crisis. Knowing the drill, keeping it current, and doing it well when it counts is the ultimate weapon against the unexpected stuff that can happen at any moment in IT operations.