Archive for redundancy

Mar
10

Don’t Taunt the Bear

Posted by: | Comments (1)

When I first started at John Deere, I was working in a division that deployed systems to dealerships. Up until that point, they hadn’t done anything with hardware RAID. Dealerships are extremely cost-conscious, and while I was a huge believer in the value of hardware RAID arrays, they needed to prove their merit. At that time, HP was the preferred vendor for dealership equipment so I had gotten them to provide us a demonstration server with a hardware RAID card so I could show it off. The high point of my demo to the service staff was when I pulled a drive out of the running server while it was in the middle of running a very visible, high load process – and to everyone’s surprise it would just keep running! The first time I did the demo, it worked great – I pulled out the first drive and the server didn’t miss a beat.

A day later I was doing the same demo for a group of managers. The previous day’s work had been fruitful – it had gotten the attention I wanted and now a higher group wanted to discuss it. This time around, someone raised the question “so, any drive can fail and the system keeps running?” With much bravado I replied “sure! Watch!” and pulled out the second drive. Two seconds later to my shock the system froze and then went to a blue screen.

This was when I discovered that, unlike the Compaq systems I was used to the HP system didn’t automatically rebuild by default when you reinserted the drive.

I took a number of lessons away from this:

  1. Don’t assume each vendor’s equipment works the same way, even if that way seems to make a lot of sense.
  2. There is almost no amount of check & recheck that is too much when removing redundant components.

When you work with systems designed for high reliability, it’s often tempting to take advantage of the innate redundancy of the system to allow you to be somewhat more cavalier in your operational procedures. For example, if you have two web servers that are part of a load balance cluster, conceptually you can take one offline, reboot it and do whatever – right in the middle of the day when it’s convenient to your IT staff. On the surface, there’s nothing wrong with this – if everything operates as designed, you should be able to rip out the second server and do whatever you want without causing a problem. It’s very tempting to forget the cluster while working on the server.

However, it often pays to be vigilant in this circumstance. Don’t taunt the bear – just because it shouldn’t cause a problem, doesn’t mean it won’t cause a problem. For example – what if during the reboot the server comes back on line? Depending on how exactly your load balancing system works it may start getting new requests because it appears to be operational. It’s very hard to explain to your peers and the rest of the business why you went offline because you took a shortcut.

There is a fine line between taking advantage of redundancy and causing problems.

Don’t count on Redundancy

At a SaaS company I worked for we had a highly redundant SAN. Each server had two cards, they connected to two independent switches which in turn each had a connection to the two storage processors that ran the array. The whole system was designed and certified by the vendor to operate without interruption in the face of a failure of a card, switch, storage processor, etc. It also was designed to be continuously operational while having every component upgraded – the firmware of the switch, the storage processors, etc.

This highly redundant design opens the possibility of performing configuration changes, firmware upgrades, even component replacement during the day while business is going on – after all, it should work just fine. This is a good example of being tempted to taunt the bear – just because a system should be redundant and not have a problem with what you’re doing, don’t bank on that capability if you don’t have a compelling reason to do so. If you have to do it, don’t rely on automatic redundancy behavior – manually take the component offline.

Treat the bear with respect. If you can, schedule work for maintenance time periods so that if there is a service interruption it will have the smallest impact. If you have a good deal of experience that a particular action won’t cause a problem then you might perform it just outside of business hours instead of during maintenance time periods (which are often in the dark of night).

Restoring Redundancy

The rules change a little when dealing with a failure. For example, if you have a drive fail in a redundant array and get in a new drive you have to balance the competing goals of restoring redundancy and the risk of replacing the drive. There are number of risk elements in replacing a failed drive:

  1. You could pull the wrong drive, causing the whole array to fail.
  2. The physical disconnection of the drive could cause a SCSI bus reset or some other momentary interruption of data on the array.
  3. The new drive could be electrically defective and short the bus.
  4. Mechanically inserting the drive could disrupt the bus or jar another drive or other physical part, causing the array to fail.

So, how do you balance the desire to replace the failed drive with the risks of causing the array to fail?

  1. If the system is stable and still redundant, wait until the next scheduled maintenance period to perform corrective action. There’s no rush.
  2. If it is not redundant, but operable, you need to balance risk with benefit. It is very unlikely that an independent part will fail within 24 hours of another failure, so you can almost always wait until a low activity time outside of business hours or even in the middle of the night to replace the component.
  3. If the system is not stable, you have the most difficult decision. First, don’t make this on your own. Get together at least the available IT engineers and, if at all possible, a representative of the business process(es) affected by the problem. You need to balance the current instability with the probability that you will make it worse by changing the system. If it’s just a dead drive, this is pretty easy: Low risk, high benefit (however it’s unlikely you’d be in an unstable situation if this happened).

Lockout / Tagout

Clustering systems combine the ability to automatically recognize when a node is down (automatic failover) and be manually told to ignore a node (manual failover). Before performing invasive work on a node in the cluster that has been taken offline automatically, go back to the clustering system and place the node offline manually. Think of this as being the equivalent of procedures used when working with dangerous machinery – Lockout/Tagout. Straight from our friends at OSHA:

“Lockout/Tagout (LOTO)” refers to specific practices and procedures to safeguard employees from the unexpected energization or startup of machinery and equipment, or the release of hazardous energy during service or maintenance activities.

This is exactly what we want to do – make sure while we’re performing actions that impair the availability of part of a reliable system we have the cluster configured so that the part can’t be accidentally used. There are two parts of this: Lock out the item so it can’t be unintentionally accessed and tag the device so that everyone knows that it’s locked out. You want to be clear on how to accomplish both for each cluster you have. The latter may take the form of just notification – an email to your support team – or a post on a central site. The point is you need a big, visible way of clearly communicating the status of the device.

If your clustering mechanism doesn’t have a way of doing this, or it relies on the node itself (such as Windows NLB) you should consider it always live and dangerous.

Nice Bear. Friendly Bear.

If the bear is working well, let him continue doing what he’s doing. Your running system should be treated with respect at all times, because there is a great deal of complexity that goes into each of the elements and how they work together, even if it appears simple on the outside. As a person responsible for a reliable system, you need to always be thinking in the long term. You don’t want to cause an outage just to deploy an upgraded component or firmware. Almost without question, the theoretical issues fixed by the firmware update aren’t going to be as important to your customers are the real issues caused by a service interruption.

Categories : Infrastructure
Comments (1)