Archive for IT Management
Top three things to improve reliability
Posted by: | CommentsQuick – what are the three things you should do to make the great improvement in the reliability and availability of the systems you’re responsible for?
Marketing for IT products and the general media tend to emphasize opportunities to purchase reliability. This makes sense because they’re in the business of selling things. Classic examples are the emphasis on extraordinarily redundant server hardware. A modern server can be purchased with redundant disks, redundant power supplies, redundant memory, and even in some extraordinary cases redundant processors. This is designed to let them prove that their server hardware has a staggeringly high mean time between failure, and who wants to be the IT manager that takes an outage because they didn’t purchase a reliability option they could have.
Before charging down the road of buying ever more elaborate hardware redundancy, let’s sit back and look at the big picture of where failures are coming from.
- A well trained person will make a mistake on the order of one time for every one hundred opportunities. Not all of those mistakes will result in an outage, but many will.
- If your solution employs any custom software, it is far more likely to have a problem that would cause an outage than widely-used off-the-shelf software. As a general rule of thumb, the longer a piece of software has been used, the more reliable it has become because the logic errors in it have been found & resolved.
- Hardware fails in a well established bowl shaped curve with most failures occurring while the hardware is very young (typically in the first 60 days it is operating) and then the failure rate starts picking up again in approximately five years for enterprise hardware, three or so for consumer grade hardware. Even then, the failure slope is typically very gentile.
From basic reliability monitoring (link to detail) we get the following points about improving the availability of a given system:
- To improve the reliability of the whole system, focus on the worst item. Nothing else will have a useful impact.
- Reliability only gets worse when you add new components to the system that have to function for the system to function.
- When you employ load balanced clustering, controlling how long it takes to fix a down system is a significant driver in the effective availability of the system. This is often referred to as the Mean Time To Recovery (MTTR). This means you must employ monitoring to detect when a redundant item isn’t working so you can restore redundancy as soon as possible.
- Failover clustering is primarily for having predictable, controlled downtime which ideally is during maintenance periods that do not count against your availability. Its primary benefit is consistency and scheduling.
Now that we’ve gone through that groundwork, let’s go back to the original question: What can we do that will have the most effect on the reliability and availability of our system?
- It’s the people & processes: Human error is the single greatest cause of downtime. In nearly all cases, you can get your best overall improvements by reviewing the people factors that drove your availability.
- Make new systems prove themselves: Whether it’s hardware or software, give it some time running where it ultimately will live before you trust it. About 60 days for most server-grade hardware will identify the hard drives that are going to fail (by far the most likely failure) and even less (10 days) will typically illuminate electronic demons such as memory, network cards, etc.
- Install Monitoring: However you do it, make sure you have monitoring so you know positively that things are healthy, and that you’ll get alarms when they are not. Having a RAID array doesn’t help you if no one notices the first disk die.
What’s your experience?
Have a great story to share? Disagree with this approach? Post your comments or drop me a line to continue the conversation.