Posts Tagged ‘SAN’
Don’t Taunt the Bear
Written by Kendall Miller on March 10, 2008 – 12:45 amWhen I first started at John Deere, I was working in a division that deployed systems to dealerships. Up until that point, they hadn’t done anything with hardware RAID. Dealerships are extremely cost-conscious, and while I was a huge believer in the value of hardware RAID arrays, they needed to prove their merit. At that time, HP was the preferred vendor for dealership equipment so I had gotten them to provide us a demonstration server with a hardware RAID card so I could show it off. The high point of my demo to the service staff was when I pulled a drive out of the running server while it was in the middle of running a very visible, high load process - and to everyone’s surprise it would just keep running! The first time I did the demo, it worked great - I pulled out the first drive and the server didn’t miss a beat.
A day later I was doing the same demo for a group of managers. The previous day’s work had been fruitful - it had gotten the attention I wanted and now a higher group wanted to discuss it. This time around, someone raised the question “so, any drive can fail and the system keeps running?” With much bravado I replied “sure! Watch!” and pulled out the second drive. Two seconds later to my shock the system froze and then went to a blue screen.
This was when I discovered that, unlike the Compaq systems I was used to the HP system didn’t automatically rebuild by default when you reinserted the drive.
I took a number of lessons away from this:
- Don’t assume each vendor’s equipment works the same way, even if that way seems to make a lot of sense.
- There is almost no amount of check & recheck that is too much when removing redundant components.
When you work with systems designed for high reliability, it’s often tempting to take advantage of the innate redundancy of the system to allow you to be somewhat more cavalier in your operational procedures. For example, if you have two web servers that are part of a load balance cluster, conceptually you can take one offline, reboot it and do whatever - right in the middle of the day when it’s convenient to your IT staff. On the surface, there’s nothing wrong with this - if everything operates as designed, you should be able to rip out the second server and do whatever you want without causing a problem. It’s very tempting to forget the cluster while working on the server.
However, it often pays to be vigilant in this circumstance. Don’t taunt the bear - just because it shouldn’t cause a problem, doesn’t mean it won’t cause a problem. For example - what if during the reboot the server comes back on line? Depending on how exactly your load balancing system works it may start getting new requests because it appears to be operational. It’s very hard to explain to your peers and the rest of the business why you went offline because you took a shortcut.
There is a fine line between taking advantage of redundancy and causing problems.
Don’t count on Redundancy
At a SaaS company I worked for we had a highly redundant SAN. Each server had two cards, they connected to two independent switches which in turn each had a connection to the two storage processors that ran the array. The whole system was designed and certified by the vendor to operate without interruption in the face of a failure of a card, switch, storage processor, etc. It also was designed to be continuously operational while having every component upgraded - the firmware of the switch, the storage processors, etc.
This highly redundant design opens the possibility of performing configuration changes, firmware upgrades, even component replacement during the day while business is going on - after all, it should work just fine. This is a good example of being tempted to taunt the bear - just because a system should be redundant and not have a problem with what you’re doing, don’t bank on that capability if you don’t have a compelling reason to do so. If you have to do it, don’t rely on automatic redundancy behavior - manually take the component offline.
Treat the bear with respect. If you can, schedule work for maintenance time periods so that if there is a service interruption it will have the smallest impact. If you have a good deal of experience that a particular action won’t cause a problem then you might perform it just outside of business hours instead of during maintenance time periods (which are often in the dark of night).
Restoring Redundancy
The rules change a little when dealing with a failure. For example, if you have a drive fail in a redundant array and get in a new drive you have to balance the competing goals of restoring redundancy and the risk of replacing the drive. There are number of risk elements in replacing a failed drive:
- You could pull the wrong drive, causing the whole array to fail.
- The physical disconnection of the drive could cause a SCSI bus reset or some other momentary interruption of data on the array.
- The new drive could be electrically defective and short the bus.
- Mechanically inserting the drive could disrupt the bus or jar another drive or other physical part, causing the array to fail.
So, how do you balance the desire to replace the failed drive with the risks of causing the array to fail?
- If the system is stable and still redundant, wait until the next scheduled maintenance period to perform corrective action. There’s no rush.
- If it is not redundant, but operable, you need to balance risk with benefit. It is very unlikely that an independent part will fail within 24 hours of another failure, so you can almost always wait until a low activity time outside of business hours or even in the middle of the night to replace the component.
- If the system is not stable, you have the most difficult decision. First, don’t make this on your own. Get together at least the available IT engineers and, if at all possible, a representative of the business process(es) affected by the problem. You need to balance the current instability with the probability that you will make it worse by changing the system. If it’s just a dead drive, this is pretty easy: Low risk, high benefit (however it’s unlikely you’d be in an unstable situation if this happened).
Lockout / Tagout
Clustering systems combine the ability to automatically recognize when a node is down (automatic failover) and be manually told to ignore a node (manual failover). Before performing invasive work on a node in the cluster that has been taken offline automatically, go back to the clustering system and place the node offline manually. Think of this as being the equivalent of procedures used when working with dangerous machinery - Lockout/Tagout. Straight from our friends at OSHA:
“Lockout/Tagout (LOTO)” refers to specific practices and procedures to safeguard employees from the unexpected energization or startup of machinery and equipment, or the release of hazardous energy during service or maintenance activities.
This is exactly what we want to do - make sure while we’re performing actions that impair the availability of part of a reliable system we have the cluster configured so that the part can’t be accidentally used. There are two parts of this: Lock out the item so it can’t be unintentionally accessed and tag the device so that everyone knows that it’s locked out. You want to be clear on how to accomplish both for each cluster you have. The latter may take the form of just notification - an email to your support team - or a post on a central site. The point is you need a big, visible way of clearly communicating the status of the device.
If your clustering mechanism doesn’t have a way of doing this, or it relies on the node itself (such as Windows NLB) you should consider it always live and dangerous.
Nice Bear. Friendly Bear.
If the bear is working well, let him continue doing what he’s doing. Your running system should be treated with respect at all times, because there is a great deal of complexity that goes into each of the elements and how they work together, even if it appears simple on the outside. As a person responsible for a reliable system, you need to always be thinking in the long term. You don’t want to cause an outage just to deploy an upgraded component or firmware. Almost without question, the theoretical issues fixed by the firmware update aren’t going to be as important to your customers are the real issues caused by a service interruption.
Tags: cluster, lockout, redundancy, SAN
Posted in Infrastructure | 1 Comment »
Why you should use Microsoft Cluster Service (MSCS)
Written by Kendall Miller on February 18, 2008 – 2:15 amIf you go through the web and do as much research as you can, you’ll find very polarized opinions about MSCS. I’ve been using it since 2002 and have found it to be outstanding, but I can see some pitfalls that could create a bad rap for it.
Why are you clustering?
First, I think Microsoft does it a miss-service in how they market it. Instinctively, most people focus on using MSCS in case a given computer’s hardware or operating system spontaneously fail. I’d say that in operating a number of clusters over six years in time, this was a very rare event for us. In fact, it only happened when we had some brand new hardware fail within its burn in period. Instead, we’ve found that its great value is in reducing downtime due to maintenance activities.
Example Server Update
Consider the scenario of needing to install the latest patches from Windows Update on your database server. Below are the steps you could go through without clustering:
- Wait until your maintenance window (let’s assume it’s 1:00 AM on Sunday morning, the low time of your load profile).
- Take the applications that use your database server offline (to be nice to your users and ensure everything closes).
- Install the patches on your database server
- Reboot your database server
- Verify that the server works (that the patches haven’t introduced a problem)
- Bring all applications back online
What’s noteworthy in the list above are the items that have a variable duration (it may take a different amount of time each time you do maintenance and may not be particularly predictable) vs. a fixed amount of time. In particular, #3 and #5 are variable (and #4 may be.).
Now lets play that again if you have MSCS installed:
- Install patches on the offline database server node.
- Reboot the offline server.
- Wait until your maintenance window
- Take the applications that use your database server offline (to be nice to your users and ensure everything closes)
- Failover to the offline server
- Verify that the server works (that the patches haven’t introduced a problem)
- Bring all applications back online.
- Wait a reasonable period of time (like a few days) and install patches on the server that’s now offline
- Reboot the offline server.
It is more steps (because there are two servers involved) but what we’ve done is moved things that take variable time outside of the critical window when the system is in maintenance mode. Everything that is happening during the maintenance mode (steps 4-7) is predictable. Additionally, I consider any server reboot to be risky. Problems tend to show up during a reboot that show up at no other time - hardware problems and even in a reasonably tight environment it’s possible there’s a configuration change made that hasn’t taken effect yet that will on reboot and cause a problem. With an MSCS cluster, this risky event is happening while the server is offline and won’t affect the production use of your application. You’ve also verified the basic integrity of the patches (after all - the server booted and you can monitor its event log to know its basically healthy) before even scheduling your maintenance period.
The comparison gets even better when you consider what happens in the first scenario above if you need to roll back a patch. With a cluster, you just fail back to the original node and you’re good to go. Without a cluster, you have to uninstall the patch, reboot, and re-certify.
Benefits Summary
- Clustering makes system maintenance predictable and short.
- Clustering lets you do risky things during main business hours instead of the middle of the night
- Clustering lets you roll back a change very quickly and easily
If you’re clustering for these reasons, you’ll get great value out of it.
How are you clustering?
Shared Storage - The Traditional Approach
Microsoft has worked to make MSCS work with a pretty broad range of hardware to their credit. Traditionally, MSCS depends on being able to expose disks to more than one server at the same time. This can be done with the traditional server direct attach storage (DAS) technology - SCSI (and now SAS) however it relies on a set of very intricate hardware - RAID controllers in each server, special cutover terminators in the storage enclosure, etc. There is a lot that can go wrong, and when it does you may lose all of your data. For example, the configuration in the RAID controllers has to agree on what the virtual disks are. The shared storage was used at least for a special drive (called the Quorum drive) that stored central cluster configuration data and defined who was the current active node of the cluster. Additionally, any clustered service (like Microsoft SQL Server or Exchange) would typically have its disks also shared between the nodes in the cluster. If you don’t need to split your clustered nodes into different data centers (to create a geodiverse or “stretch” cluster) then this is a solid and straightforward way to go.
What I recommend is that you use a storage technology that encapsulates all of the RAID technology separate from the servers and is based on a technology that is fundamentally oriented towards sharing disks with multiple servers. This way you minimize the configuration on each server and the probability that a difference between servers will lose data. The traditional way of doing that is with a Storage Area Network (SAN). If you consider the two primary SAN technologies (Fibre Channel and iSCSI) both are fundamentally about sharing storage with multiple servers.
If you are only installing a shared storage array for one cluster, you can technically do without the hardware that makes a SAN a SAN - you can have a shared array directly attached to two servers. Most storage arrays support this, and it’s a very cost effective way to get started with separate storage arrays and be able to build later on this foundation to make a full size SAN down the road to optimize your operating costs. You’ll realize another benefit which is that these arrays are almost universally much faster and more scalable than direct attach storage is, for a range of reasons. You’ll be amazed at how much scalability it adds to your database server.
Shared Nothing Approach
Possible in Windows Server 2003 R2 Enterprise, significantly improved in Windows Server 2008 is the ability to set up a cluster that doesn’t rely on the quorum drive being a single physical resource. Instead, it employs a third server (called the Witness server, which can’t actually host the clustered processes) that each node in the cluster can talk to across the network or voting between the servers in the case of three or more nodes being in the cluster itself. The elimination of requiring the quorum to be physically accessible to every node on the cluster means that services that don’t rely on shared storage (such as a simple Windows service) can be easily implemented. This can even extend to Microsoft SQL Server and Microsoft Exchange in their latest version because they are capable of replicating their own content through log shipping. The sheer number of options here can be a lot to sift through the first time, but the results are worth it.
My Personal Experience
I’ve always used a SAN from a major vendor that certified the SAN for use with MSCS, and never experienced problems with MSCS. Use them, or don’t use MSCS based on shared storage.
The most important factor to being successful with failover clustering is to use high quality hardware for the server and storage system. Look for vendors that have certified their systems for use as part of an MSCS cluster to ensure they got all of the little details right.
Where should you use MSCS?
MSCS is a failover cluster system. Use it when you can’t use a load-balanced clustering option. In general, this is when there’s a natural requirement to have just one of something at a time, most commonly databases (because to be performant they need exclusive access to their files). If you have a load-balanced clustering option, it’s probably going to be less expensive to set up and maintain than MSCS.
If your organization is a solid user of Microsoft SQL Server, I highly recommend investing in at least one MSCS cluster to host your SQL database servers. You can use a single physical cluster to host multiple SQL database servers, an option that makes it particularly cost effective. You can set server affinity so that two instances of SQL Server prefer to run on different physical servers within the cluster, giving you the best utilization of hardware while preserving redundancy It is somewhat more complicated to set up because you have to use logical servers from the start with SQL Server which you don’t have to if there is just one, however the cost savings can help justify clustering. You might, for example, have both a certification and production SQL Server on one pair of physical servers in an MSCS cluster. This makes it somewhat easier to ensure that your certification and production environments are absolutely identical and lets you generally separate certification and production from interfering with each other without having to purchase two separate clusters.
Advanced clustering scenarios
Remember that while most articles and documentation talk about the basic clustering case of two servers & a SAN or other shared storage, as of Windows Server 2003 you can have more than two nodes and can have them use separate shared storage, provided that you have a means to synchronize it. This can be used in a few great scenarios:
- Geodiversity: You can have two separate facilities, each with one or more servers and fail over between the facilities.
- Upgrades and Maintenance: You can use the ability to have additional nodes and separate storage to allow you to take the shared storage system entirely offline in the event of disruptive maintenance or upgrades. I’ve actually used this method to incrementally upgrade and replace cluster systems before where taking the risk of a complete switchover was considered too high.
Moving from basic clustering with a single shared storage array to separate storage arrays is a significant jump in complexity and typically cost because you have to have a highly reliable means to keep the arrays in sync. High end storage vendors typically have this capability for their arrays, and there are third party options that can work with anyone’s SAN. Remember that you will need significant network capacity between your sites. Suffice it to say that if you’re going to go down this road, you’ll want help from someone that’s done it before. I recommend engaging storage professionals because this tends to be the most difficult part of the process.
What’s your experience?
Have you used MSCS? How has it worked out for you? Post your comments or drop me a line to continue the conversation.
Tags: Clustering, High Availability, Infrastructure, MSCS, SAN
Posted in Clustering | 2 Comments »