A system reliability checklist
When you're setting up a new service, system or site here is a checklist that you can give to your stakeholders/site owners/service champions. It may help you determine what kind of redundancy and backup system you need, what your options for maintenance windows are, etc.
This is primarily used to start a discussion going: they fill this out to tell you what they want, and you come back with how much it will cost them to get that. Then, you negotiate.
Question 1 is about offsite backups and disaster recovery.
Question 2 is about near site backups.
Question 3 is about standard full recovery.
Question 4 is full backups which could be anywhere.
Question 5 is about point recovery.
Questions 6 and 7 are about responsiveness to unplanned outages.
Questions 8 and 9 are about timing for planned outages.
Questions 10 and 11 are about frequency of planned outages.
Backup and Recovery Questions
- Suppose that there was a catastrophic earthquake that destroyed all IT assets in your organization. When you rebuild those assets, the system that runs the service and restore the data that the service depends on, that data should be:
- no older than minutes before the disaster.
- no older than hours before the disaster.
- no older than the day before the disaster.
- no older than a week before the disaster.
- unrecoverable. We don't expect to be able to recover from this kind of disaster.
- Suppose that there was an earthquake or fire that destroyed the room that contains the machine the service runs on, such that the system that the service runs on, its data storage and the tape backup system and the tapes in it were destroyed, along with all the rest of the machines in that room. When we rebuild the room, the system and data storage and restore the data, that data should be:
- no older than minutes before the disaster.
- no older than hours before the disaster.
- no older than the day before the disaster.
- no older than a week before the disaster.
- unrecoverable. We don't expect to be able to recover from this kind of disaster.
- Suppose now that the earthquake or fire did not destroy the whole room, but only the system and its data storage. The tape backup system and the tapes in it are ok. When we rebuild the system and data storage and restore the data, that data should be:
- no older than minutes before the disaster.
- no older than hours before the disaster.
- no older than the day before the disaster.
- no older than a week before the disaster.
- unrecoverable. We don't expect to be able to recover from this kind of disaster.
- Suppose now that there was no fire or earthquake. Somehow, however, the data the system uses was lost. The system, its data storage and the tape backup system and the tapes in it are all ok. When we restore the data, that data should be:
- no older than minutes before the disaster.
- no older than hours before the disaster.
- no older than the day before the disaster.
- no older than a week before the disaster.
- unrecoverable. We don't expect to be able to recover from this kind of disaster.
- Suppose now that the system, its data storage and most of the data in it is ok, but somehow, some subset of the data was deleted when it should not have been. The system, its data storage, etc. are all up, running and ok. When we go to restore that deleted data, we should be able to restore data which is:
- no older than minutes before the disaster.
- no older than hours before the disaster.
- no older than the day before the disaster.
- no older than a week before the disaster.
- unrecoverable. We don't expect to be able to recover from this kind of disaster.
Uptime and Responsiveness Questions
- If you have an unexpected outage during business hours that does not destroy data, it is ok for the system to be down for:
- it's not ok for the system to be down during business hours
- a few minutes
- a few hours
- a day
- whenever we get around to fixing it
- What about outside of business hours?
- If you need to do a planned outage for maintenance (patching, database vacuuming, reboot to put in a new kernel, etc.) during business hours, it is ok for the system to be down for:
- it's not ok for the system to be down during business hours
- a few minutes
- a few hours
- a day
- however long we need to have it down
- What about outside of business hours?
- If you need to do regularly scheduled maintenance on the system which would cause an outage during business hours, how often is too often to do it?
- more than once per day is too often
- once per day is too often
- once every few days is too often
- once per week is too often
- once every few weeks is too often
- once per month is too often
- once every few months is too often
- don't do regularly scheduled outages on the system during business hours
- What about outside of business hours?

