Mainframe resilience
Monday, 23 August 2021
Have you ever been part of a Business Continuity Plan test? If you have, then you know that you tend to end up in a hotel, or some other building, with a variety of other people from an organization and ‘war game’ what would happen in the event of various scenarios. Often, an external company will be invited in to host the sessions and be on the other end of the phone when someone is trying to deal with the press. The day can often be quite fun, sometimes illuminating, and lunch is usually very good!
The big problem is that often, what can be resolved in the meeting room in a couple of minutes takes much much longer in real life. In many scenarios, a building has been burgled, or is full of terrorists, but the mainframe and the other servers are still working. If the communications line from one site have been cut, most people – as we’ve seen over the past year or so – can work from home. The organization is generally able to continue in business so long as the mainframe is still working. But what happens if it isn’t?
I can remember many years ago, at one site I worked at, putting a tick in the box for backups for a particular application. However, as all the operators knew, the backup tapes were 7-track tapes, and the last 7-track tape drive had been removed some months beforehand. There was no way that anything could be restored. I can also remember driving backup tapes to an offsite backup site at a company on the other side of town. If there had been a disaster out of hours, can you imagine how long it would have taken to get those tapes and restore the data?
Clearly, backup strategies have improved hugely since those days back in the early 1980s. Even so, a lot of emphasis is still being put on the backing up of data, and, all too often, not enough emphasis is put on restoring the data. I’m talking about mainframe resiliency.
Mainframe resiliency is the ability of the mainframe to provide and maintain an acceptable level of service even when things go wrong! Now we know that mainframes don’t crash like they used to in the 1980s, but even so, things can go wrong.
In an ideal world, organizations would take a copy of their data at regular and frequent intervals and restore from the most recent copy in the event of a problem. That would result in only a few minutes of recent changes being lost. It would also create a massive overhead and require a huge amount of storage space. Some companies can afford a hot standby site, which is updated almost as soon as the main site’s data is changed. Should the main site go down, the standby site can take over very quickly and, hopefully, no data is lost.
Other sites take full backups once a week, and incremental backups every evening. That way, it’s possible to restore a file to its state yesterday. If journaling takes place, there will be a file that can be used to restore data almost up to just before the failure.
What I’m illustrating is that a lot of work has gone into getting backups right. What I would also suggest is that not enough attention has been spent on getting the restore part of the operation working as quickly and effectively as possible.
Let’s suppose that one application has somehow had a catastrophic failure. Let’s suppose that the DASD housing the files has died. Where can the recovery files be restored to? Exactly which backup tapes do I need to restore just those files? How quickly can I get hold of them? It’s the orchestration of the recovery operation that needs to take place in software, not in the head of someone who is out of the office that day, or printed on a piece of paper that could be missing from the backup and restore manual.
I wrote recently about Safeguarded Copy on FlashSystem arrays. It creates a security isolated copy of data that can be used in the event of the original data becoming corrupted. In fact, multiple recovery points can be created, which is great. The question is, how can you quickly decide which recovery point backup you want to restore. What software is there available that would speedily work out which recovery point is the one required and make sure that it is restored? Because, in order to speed up the restoration stage, it needs to be done by software orchestration and not trial and error of someone sitting down in front of a screen and seeing which backup is exactly the one they want. I’m not criticizing FlashSystem arrays, I’m just suggesting that the problem with speedy restores is endemic. Everyone worries about backups and they happen all the time. Not enough people are concerned about the restore process because it doesn’t happen (I’m pleased to say!) very often.
To ensure mainframe resiliency, as much effort must be put into simplifying and organizing the restore process as is put into the backup process, so that any mainframe outages last for as little time as possible and no-one – in particular paying customers – notice.
But that’s not all. What happens if nation state or criminal gang bad actors get into your mainframe? Typically, there is a period of time during which hackers raise their security level, exfiltrate useful data, overwrite backups, encrypt data, and then display a ransom demand. Mainframe resiliency also demands that there be some way to identify the early stages of a ransomware attack and stop it spreading further. It also requires that the corrupted files are restored. For this to happen, some kind of software orchestration is required to ensure that the correct (and uncorrupted) backup files are identified, and the data is restored as quickly as possible.
There’s a lot more to mainframe resilience than people might think when they are sitting comfortably after a good lunch discussing the business continuity plan!
If you need anything written, contact Trevor Eddolls at iTech-Ed.
Telephone number and street address are shown here.