No matter how much planning you do and no matter how much training you do when it comes down to human interaction someone will manage to screw it up. In Singapore, an IBM techie succesfully caused an outage, which lasted from 3am to 10am on Monday July 5 at the DBS Group Holdings Bank.
I’ve implemented several Disaster Recovery Plans over the years, none as significant or as complicated as a bank I should add, and one thing I have discovered is that nobody wants to test them and when you eventually talk the board into a test the IT systems generally are the only things working properly while the staff are in chaos. Their passwords are taped to their monitor, their diaries are in their top drawer with their clients contact details and they have no idea what their appointments are.
To be honest I can understand an error that brings a sub system system down, these things are almost impossible to avoid 100%, but one which starts in a small sub system and cascades down the line and brings down all the systems sounds like a major design flaw to me. The story as explained here doesn’t explain how it happened or why it brought everything crashing down so it is impossible to speculate much but I’m certain there are several key IBM techies looking at the exact chain of events with instructions to ensure it cannot happen again and every other bank on the planet will have instructions from the CEO to check their systems would not be in the same situation.
Sounds to me like this bank have a few holes in their plans. Probably because they have concentrated on a key systems failure and swap to backup systems and not performed enough tests around simple sub system failures because they thought it would be simple to fix without a failover. Events proved otherwise.
‘The more complicated you make the plumbing the easier it is to jam it up’ – Scotty – ST III.
Planning the impossible
No matter how much planning you do and no matter how much training you do when it comes down to human interaction someone will manage to screw it up. In Singapore, an IBM techie succesfully caused an outage, which lasted from 3am to 10am on Monday July 5 at the DBS Group Holdings Bank.
I’ve implemented several Disaster Recovery Plans over the years, none as significant or as complicated as a bank I should add, and one thing I have discovered is that nobody wants to test them and when you eventually talk the board into a test the IT systems generally are the only things working properly while the staff are in chaos. Their passwords are taped to their monitor, their diaries are in their top drawer with their clients contact details and they have no idea what their appointments are.
To be honest I can understand an error that brings a sub system system down, these things are almost impossible to avoid 100%, but one which starts in a small sub system and cascades down the line and brings down all the systems sounds like a major design flaw to me. The story as explained here doesn’t explain how it happened or why it brought everything crashing down so it is impossible to speculate much but I’m certain there are several key IBM techies looking at the exact chain of events with instructions to ensure it cannot happen again and every other bank on the planet will have instructions from the CEO to check their systems would not be in the same situation.
Sounds to me like this bank have a few holes in their plans. Probably because they have concentrated on a key systems failure and swap to backup systems and not performed enough tests around simple sub system failures because they thought it would be simple to fix without a failover. Events proved otherwise.
‘The more complicated you make the plumbing the easier it is to jam it up’ – Scotty – ST III.