Sidney Hui, Assistant General Manager, Group IT at HKR International
Worse Could go Worst
The cornerstone of IT services is availability. For many industries, IT systems and applications must run non-stop around the clock in the modern world. A serious breakdown does not just result in the loss of revenue and damaging the image of the brand in the process, it may cause havoc to the general public if it’s a nationwide catastrophe like the recent (8-8-2016) Delta Air Lines data center power outage event in which over 650 flights were canceled and thousands more were delayed. This was not the end of the chaos. It has a ripple effect affecting the schedules of the flights in the following days. It took so much time to resume normal when all the backlog of stranded passengers was re-booked to other flights after the systems were brought back online. Other airlines were also affected with connecting flight passengers from Delta. When the world is ever more connected than before, the consequences of such catastrophes could be staggering and far fetching.
Plan for the Unplanned
In view of the risks associated, organizations have spent considerable investment in redundant hardware, software and infrastructure to protect against failure as such. CIOs certainly understand the importance of disaster recovery facilities readiness and vendors have loads of armory to sell under their wings. ‘Recovery as a Service’ (RaaS) is now available as a cloud service. Choices and options abound in the market cater for various disaster scenarios and different levels of risk mitigation strategies. Business Continuity Planning (BCP) goes hand in hand with Disaster Recovery Plan (DRP). They are of equal importance when a disaster occurs with the latter being more related to IT in general. In some organizations, the terms are used interchangeably to refer to the same thing.
"Data centers equipped with a backup generator set that has limited fuel storage onsite and has to rely on suppliers to refuel by truck.”
Enough is not always enough
With all these plans developed, IT redundancy and backup equipment installed (or Recovery as a Service adopted), are organizations immune from the onslaught of those disasters? The answer is clearly a “No”. I am sure many of us have heard about or even experienced system crash and it was unable to be brought back online after several hours or even days. Retail banking systems, ATMs, internet banking, stock exchange trading system, airline reservation system, cloud email system and so on around the world have similar incidents affecting a lot of users. All these organizations have spent millions in disaster recovery facilities and have had a disaster recovery plan in place. Nevertheless, they would still fall victim to glitches when executing a disaster recovery plan in order to bring the systems back to normal and fail.
Drill is not the Real Deal
A well planned and well-penned disaster recovery plan is only as good as a piece of document if it is not put into practice. IT organizations should perform disaster recovery drills regularly to ensure relevant staffers are familiar with the procedures and systems are fully tested for fallback to the redundant hardware and infrastructure. Such drills are usually performed at non-office/non-peak hours in order to avoid any unwanted interruptions to the production system. It costs time and money for the IT staffs and staffs from other user departments. Unfortunately, owing to technical and non-technical considerations, in many cases, these drills are unable to simulate the ‘real’ system migration from the production environment to the ‘disaster recovery’ environment. One of the major reasons is people don’t want the production system to be affected with scheduled or unscheduled stoppages.
Things that could go Wrong will go Wrong
There is no such thing as a bulletproof disaster recovery plan. There are so many things that could happen in a real disaster situation, which you did not anticipate when developing the DR plan. Take for example in Hong Kong, due to fire safety regulations; the storage of diesel or fuel oil in excess of 500 liters in a commercial or industrial building without a license is prohibited, such approved storage facilities are built according to Fire Service Department requirements. Data centers equipped with a backup generator set that has limited fuel storage onsite and has to rely on suppliers to refuel by truck. During a prolonged power outage, especially when a typhoon hits, the continual supply of diesel fuel could be stopped if roads leading to the data center are flooded and become inaccessible. Under such circumstances, the limited supply of fuel on the site could run dry in less than a day.
Take it to the Limit……..One More Time
In conclusion, it is important to bring key representatives from every department to discuss and develop the disaster recovery plan instead of just having the IT department to formulate the process and procedures. The plan has to be reviewed and updated frequently to cope with any external changes as well as internal changes, such as new system or IT equipment installed that may render the DR plan handicapped.
We can’t stop natural disasters from happening. The best we can do is to have a comprehensively devised DR plan. Practice the procedures and test the plan as much and as far as we can to ensure when it is being put into use, people involved are familiar enough with all the steps required and could response swiftly to the emergency situation.