Latest Posts »
Latest Comments »
Popular Posts »

Why you should use Microsoft Cluster Service (MSCS)

Written by Kendall Miller on February 18, 2008 – 2:15 am

If you go through the web and do as much research as you can, you’ll find very polarized opinions about MSCS. I’ve been using it since 2002 and have found it to be outstanding, but I can see some pitfalls that could create a bad rap for it.

Why are you clustering?

First, I think Microsoft does it a miss-service in how they market it. Instinctively, most people focus on using MSCS in case a given computer’s hardware or operating system spontaneously fail. I’d say that in operating a number of clusters over six years in time, this was a very rare event for us. In fact, it only happened when we had some brand new hardware fail within its burn in period. Instead, we’ve found that its great value is in reducing downtime due to maintenance activities.

Example Server Update

Consider the scenario of needing to install the latest patches from Windows Update on your database server. Below are the steps you could go through without clustering:

  1. Wait until your maintenance window (let’s assume it’s 1:00 AM on Sunday morning, the low time of your load profile).
  2. Take the applications that use your database server offline (to be nice to your users and ensure everything closes).
  3. Install the patches on your database server
  4. Reboot your database server
  5. Verify that the server works (that the patches haven’t introduced a problem)
  6. Bring all applications back online

What’s noteworthy in the list above are the items that have a variable duration (it may take a different amount of time each time you do maintenance and may not be particularly predictable) vs. a fixed amount of time. In particular, #3 and #5 are variable (and #4 may be.).

Now lets play that again if you have MSCS installed:

  1. Install patches on the offline database server node.
  2. Reboot the offline server.
  3. Wait until your maintenance window
  4. Take the applications that use your database server offline (to be nice to your users and ensure everything closes)
  5. Failover to the offline server
  6. Verify that the server works (that the patches haven’t introduced a problem)
  7. Bring all applications back online.
  8. Wait a reasonable period of time (like a few days) and install patches on the server that’s now offline
  9. Reboot the offline server.

It is more steps (because there are two servers involved) but what we’ve done is moved things that take variable time outside of the critical window when the system is in maintenance mode. Everything that is happening during the maintenance mode (steps 4-7) is predictable. Additionally, I consider any server reboot to be risky. Problems tend to show up during a reboot that show up at no other time - hardware problems and even in a reasonably tight environment it’s possible there’s a configuration change made that hasn’t taken effect yet that will on reboot and cause a problem. With an MSCS cluster, this risky event is happening while the server is offline and won’t affect the production use of your application. You’ve also verified the basic integrity of the patches (after all - the server booted and you can monitor its event log to know its basically healthy) before even scheduling your maintenance period.

The comparison gets even better when you consider what happens in the first scenario above if you need to roll back a patch. With a cluster, you just fail back to the original node and you’re good to go. Without a cluster, you have to uninstall the patch, reboot, and re-certify.

Benefits Summary

  • Clustering makes system maintenance predictable and short.
  • Clustering lets you do risky things during main business hours instead of the middle of the night
  • Clustering lets you roll back a change very quickly and easily

If you’re clustering for these reasons, you’ll get great value out of it.

How are you clustering?

Shared Storage - The Traditional Approach

Microsoft has worked to make MSCS work with a pretty broad range of hardware to their credit. Traditionally, MSCS depends on being able to expose disks to more than one server at the same time. This can be done with the traditional server direct attach storage (DAS) technology - SCSI (and now SAS) however it relies on a set of very intricate hardware - RAID controllers in each server, special cutover terminators in the storage enclosure, etc. There is a lot that can go wrong, and when it does you may lose all of your data. For example, the configuration in the RAID controllers has to agree on what the virtual disks are. The shared storage was used at least for a special drive (called the Quorum drive) that stored central cluster configuration data and defined who was the current active node of the cluster. Additionally, any clustered service (like Microsoft SQL Server or Exchange) would typically have its disks also shared between the nodes in the cluster. If you don’t need to split your clustered nodes into different data centers (to create a geodiverse or “stretch” cluster) then this is a solid and straightforward way to go.

What I recommend is that you use a storage technology that encapsulates all of the RAID technology separate from the servers and is based on a technology that is fundamentally oriented towards sharing disks with multiple servers. This way you minimize the configuration on each server and the probability that a difference between servers will lose data. The traditional way of doing that is with a Storage Area Network (SAN). If you consider the two primary SAN technologies (Fibre Channel and iSCSI) both are fundamentally about sharing storage with multiple servers.

If you are only installing a shared storage array for one cluster, you can technically do without the hardware that makes a SAN a SAN - you can have a shared array directly attached to two servers. Most storage arrays support this, and it’s a very cost effective way to get started with separate storage arrays and be able to build later on this foundation to make a full size SAN down the road to optimize your operating costs. You’ll realize another benefit which is that these arrays are almost universally much faster and more scalable than direct attach storage is, for a range of reasons. You’ll be amazed at how much scalability it adds to your database server.

Shared Nothing Approach

Possible in Windows Server 2003 R2 Enterprise, significantly improved in Windows Server 2008 is the ability to set up a cluster that doesn’t rely on the quorum drive being a single physical resource. Instead, it employs a third server (called the Witness server, which can’t actually host the clustered processes) that each node in the cluster can talk to across the network or voting between the servers in the case of three or more nodes being in the cluster itself. The elimination of requiring the quorum to be physically accessible to every node on the cluster means that services that don’t rely on shared storage (such as a simple Windows service) can be easily implemented. This can even extend to Microsoft SQL Server and Microsoft Exchange in their latest version because they are capable of replicating their own content through log shipping. The sheer number of options here can be a lot to sift through the first time, but the results are worth it.

My Personal Experience

I’ve always used a SAN from a major vendor that certified the SAN for use with MSCS, and never experienced problems with MSCS. Use them, or don’t use MSCS based on shared storage.

The most important factor to being successful with failover clustering is to use high quality hardware for the server and storage system. Look for vendors that have certified their systems for use as part of an MSCS cluster to ensure they got all of the little details right.

Where should you use MSCS?

MSCS is a failover cluster system. Use it when you can’t use a load-balanced clustering option. In general, this is when there’s a natural requirement to have just one of something at a time, most commonly databases (because to be performant they need exclusive access to their files). If you have a load-balanced clustering option, it’s probably going to be less expensive to set up and maintain than MSCS.

If your organization is a solid user of Microsoft SQL Server, I highly recommend investing in at least one MSCS cluster to host your SQL database servers. You can use a single physical cluster to host multiple SQL database servers, an option that makes it particularly cost effective. You can set server affinity so that two instances of SQL Server prefer to run on different physical servers within the cluster, giving you the best utilization of hardware while preserving redundancy It is somewhat more complicated to set up because you have to use logical servers from the start with SQL Server which you don’t have to if there is just one, however the cost savings can help justify clustering. You might, for example, have both a certification and production SQL Server on one pair of physical servers in an MSCS cluster. This makes it somewhat easier to ensure that your certification and production environments are absolutely identical and lets you generally separate certification and production from interfering with each other without having to purchase two separate clusters.

Advanced clustering scenarios

Remember that while most articles and documentation talk about the basic clustering case of two servers & a SAN or other shared storage, as of Windows Server 2003 you can have more than two nodes and can have them use separate shared storage, provided that you have a means to synchronize it. This can be used in a few great scenarios:

  1. Geodiversity: You can have two separate facilities, each with one or more servers and fail over between the facilities.
  2. Upgrades and Maintenance: You can use the ability to have additional nodes and separate storage to allow you to take the shared storage system entirely offline in the event of disruptive maintenance or upgrades. I’ve actually used this method to incrementally upgrade and replace cluster systems before where taking the risk of a complete switchover was considered too high.

Moving from basic clustering with a single shared storage array to separate storage arrays is a significant jump in complexity and typically cost because you have to have a highly reliable means to keep the arrays in sync. High end storage vendors typically have this capability for their arrays, and there are third party options that can work with anyone’s SAN. Remember that you will need significant network capacity between your sites. Suffice it to say that if you’re going to go down this road, you’ll want help from someone that’s done it before. I recommend engaging storage professionals because this tends to be the most difficult part of the process.

What’s your experience?

Have you used MSCS? How has it worked out for you? Post your comments or drop me a line to continue the conversation.


Tags: , , , ,
Posted in Clustering | No Comments »

Top three things to improve reliability

Written by Kendall Miller on February 9, 2008 – 2:03 am

Quick - what are the three things you should do to make the great improvement in the reliability and availability of the systems you’re responsible for?

Marketing for IT products and the general media tend to emphasize opportunities to purchase reliability. This makes sense because they’re in the business of selling things. Classic examples are the emphasis on extraordinarily redundant server hardware. A modern server can be purchased with redundant disks, redundant power supplies, redundant memory, and even in some extraordinary cases redundant processors. This is designed to let them prove that their server hardware has a staggeringly high mean time between failure, and who wants to be the IT manager that takes an outage because they didn’t purchase a reliability option they could have.

Before charging down the road of buying ever more elaborate hardware redundancy, let’s sit back and look at the big picture of where failures are coming from. Read more »


Tags: , ,
Posted in Clustering, Infrastructure, Software Development | No Comments »

Effort doesn’t equal Value

Written by Kendall Miller on February 2, 2008 – 1:20 pm

Consider this simple point:

Effort ≠ Value

Think about it for a few minutes and it seems patently obvious: Just because something’s difficult doesn’t mean it has great value. For example, if I want to mail 50 letters to clients and I put an individual stamp on each one instead of using an automatic postage machine I’ve achieved the same value: I can now send these letters to each of my customers. They’ll get there just as fast, the postage is just as valid. Therefore, if it takes me 15 minutes to put the stamps on one by one vs. about 1 minute to run it through a machine I’ve spent 15 times as much effort to achieve the same value.

It works in reverse as well: Just because something has great value doesn’t mean it’s intrinsically difficult. It may be exceedingly valuable to me to get a message to a client that lives on the other side of the country, and yet it’s really easy to do: In just a few seconds with my cell phone I can reach out any time of the day, from virtually anywhere. Low effort, high value.

Obvious, and yet we ignore the implications of this every day. We naturally assume that anything worthwhile takes effort, and that anything that takes a lot of effort was worthwhile.

Good examples of low effort, high value

Cosmetic defects are a classic example of this. It isn’t unusual at all to go through a new software application and find a substantial number of cosmetic defects: Alignment issues, inconsistencies in language (is it login, logon, or user id? Do you click or press that button?), spelling or language errors and a range of items that aren’t application behavioral issues (like tab order). Developers tend to instinctively minimize these issues: They’re trivial to resolve and they don’t prevent the application from working. They aren’t anywhere near whatever hideously complicated part of the system the developer is really worried about, and they’ll take no time to get right later. They can’t be that important, so development teams tend to not talk about them or work them. Even the term “cosmetic defect” is often used as a label for trivial or low value: “that’s no big deal, it’s just a cosmetic issue. Now let’s talk about that rare crash on every other leap year if you attempt to delete a customer with no records!”. This perspective isn’t even particularly unreasonable if you’re looking at the development process from a risk management perspective: You know the issues can be cleaned up quickly and without a lot of technical risk, and if you clean them all up now you’ll still have to do a recheck of the system before release because new ones will show up.

Now look at it from the standpoint of an end user of the system. The system is a black box: They don’t see the really artful code that figures out automatically when they enter a name as Last, First or First Last or how you managed to make a really fast look-ahead search system despite the large number of records you have to work with. Instead, they’ll see what’s right in front of them: The user experience of the application itself. If they start it up and notice immediately that things aren’t lined up vertically & horizontally or there are spelling errors it will bring rise to the classic line of reasoning that if you didn’t get this right, what hope is there that the black box is right? The more you protest that this is easy to fix the worse it gets: If you couldn’t get the easy to fix simple stuff right then there’s now no way the detailed 12 step process for determining how much to bill a client is going to be right, and the user is going to have to check it all before you regain their trust.

The good news is that this direction is the easiest to avoid as a manager or team member of a software development project. Once you’ve had the above experience once or twice, you will start to get wise and do a cosmetic issue pass at strategic points in the time line - usually just a few builds before it’s going to be seen by people outside of the team. You’ll be surprised at both how many items show up each time, and how easily they clean up. Then, while you’re sweating during the big demo about whether you’re going to get a runtime error you’ll at least have the comfort that what they are seeing while they’re waiting for the next page represents the good work your team did in a way that communicates to the average user that can’t see behind the curtain.

Good examples of high effort, low value

This trap is more dangerous and harder to avoid. At many points through the development process you’ll have opportunities to chose architectures, designs, algorithms and other items that will either increase or decrease the effort it’ll take to complete the project. You might chose to not use that built-in dialog to open files and instead make your own dialog because of one annoying behavior you really want to avoid. Or decide that you want to make a better column sizing routine for the grids you display so that you can avoid either trying to cram too much on a small screen or having acres of empty space on a large one. None of these are on their own bad ideas necessarily, and that’s part of the trap: Most development processes by design tend to focus team attention on the things that are hard, high risk, or just time consuming because these have the biggest ROI for project management activities. This reinforces our built in instincts to presume that the harder the work, the greater the value.

What this ignores is that the value is essentially constant regardless of effort: Any particular feature or capability has a set value in the eyes of the user. Our goal is to realize that value with if not the minimum effort then something that appears (prior to construction) to be the minimum effort that has an acceptable risk. In its most direct form, this means that the user places the same value on a five thousand line algorithm to determine optimal column width and using a method built into a control to get it right, as long as the outcome achieves their expectation.

<tip>Corollary: Be sure what’s important to you is also important to your users before investing a lot of time. Perhaps they don’t care if there’s a bunch of empty space on their 24″ widescreen monitor as much as you do. Get evidence commensurate to the effort you think it’ll take to resolve the issue.</tip>

Understand the trap

This issue tends to manifest itself in some classic ways. One is when a developer argues passionately in favor of a complicated algorithm even in the face of peer review that casts substantial doubt on its necessity. Typically the developer caught one small aspect of the problem and has ruthlessly optimized for it, and uses that one point as the proof of why simpler approaches don’t work (”If users are constantly switching back and forth between these two displays it’s 30% faster to do it this way than what’s built in”). These items also tend to be defect prone and difficult to explain to others.

Complicating this trap are a few factors:

  • There are hard problems to solve: You can’t assume every hard problem is really an overcomplicated solution. Most applications will have at least two places where there is some real trickery and engineering to get the right result each and every time. If there weren’t, your users probably wouldn’t want the application in the first place.
  • There are low value problems to solve: There are hard problems that have to be solved, and some of these are even relatively low value to the customer but are still a requirement. Consider this example: The customer places relatively low value on your application not crashing when they run it. Don’t get me wrong - if it starts crashing they will be very upset, but they simply assume that it won’t crash. All joking about Microsoft aside, any application you write is virtually guaranteed to be more crash happy than Microsoft Office is. So you’re going to end up investing a lot in something users don’t really place a lot of incremental value on.
  • Developers are Optimists: Developers like hard problems (after all - hard problems are valuable problems according to our instincts) and want to solve them. They will underestimate the effort going in and overstate the value of the journey. If it’s a new problem, it’s unlikely that their estimate is particularly great even if they aren’t focused on why a particular complicated approach is necessary.

Striking the Balance

How to avoid this trap? First, For these problems to get out of hand it usually requires the ability for one or two developers to go off away from the herd for long enough to cook up a complicated idea, justify the effort to themselves, and then get far enough into the swamp to be in real trouble. Depending on your specific software development approach, find ways to catch the telltale signs before developers sink enough effort into the solution to get permanently attached to it.

Second, be fully prepared to throw out an already developed solution regardless of how much code or effort it is. In other words, the decision on whether or not to back up and take another approach should be largely blind to how many lines are being thrown out as long as they are all part of the same solution. Even at this point the instinctive desire to equate effort and value will creep into the entire team’s thinking: people will look at the large block of code and assume that it must be necessary, we’re just missing the subtlety of why it is the light & the way. This is what source code control is for (you do use source code control, don’t you?). It enables you to with no fear reject a bird in hand for a simpler bird that may converge faster, have fewer issues, and ultimately be a more cost effective way of providing the value your customers are expecting. Remember that even if you’ve taken a complicated implementation through initial unit testing, there is still a substantial investment that will be made in that code over time.

A production implementation is worth many theories

This article has been talking about high effort ways of achieving value in software, and the approach shouldn’t be generalized into applying to applications simply because they are large or complicated, or even any particular solution that is large and complicated - provided it got there incrementally over time. While it’s often tempting to look at a few hundred line block of code that does just one thing and think that in this age of objects, partial template classes, interfaces, and reflection there just has to be a cute, simple implementation that’s less than half the size and complexity of the current solution there are two key issues with this thinking:

  1. Large, stable code already achieved value: If the block of code is substantially stable and accepted by the customers then it has achieved its value and any effort spent on refactoring it that doesn’t also deliver more value to customers isn’t improving the value of the application
  2. Refactoring introduces defects: It’s virtually guaranteed that in the process of refactoring the existing routine sufficiently to make a good dent in its complexity you’re going to introduce some new defects just due to conceptual or implementation oversight during the process. It’s generally not considered a great justification to management that you introduced defects that then require expense to clean up in an effort to avoid possible future expense maintaining code.
  3. Code gets larger because it handles very subtle points: If the code organically grew over time to be a complicated routine, it probably did so because it was progressively asked to handle a number of interesting boundary cases that experience with the application proved necessary. In the minds of the users, these subtle behaviors can be some of the greatest value they place on your application - and yet they’ll never mention the feature when talking with you about it. Therefore, when you refactor it you have to preserve every little behavior, and that is often infeasible (with the exception of true defects in design of the original code - well isolated routines that can be replaced with provably equivalent code)

If it’s already in production and doesn’t have a critical flaw that the business needs addressed, leave it alone.

Clearly communicate the end-user value within your team

The best technique to avoid this problem is to make sure your development team has a tradition of discussing the end-user value of the work being done. This tradition would mean that anyone gets to clarify what the end user value of any work that’s being done is - that’s a free question never met with ridicule. It may take some practice to make sure the question is asked correctly, e.g. it’s asked in a way that gets the entire team to back up and be clear about why the work is important. Within that questioning, its important to make sure the discussion is based on what’s important to the users of the system, not the developers or other non-users. There are times to focus on maintenance or other non-user issues but with rare exception it isn’t the reason you’re writing the system.

With some practice, this will become a strong self-regulation mechanism for the team, ensuring that your discussions about design and approach are grounded in the needs of your customer. It creates a good mental yardstick for how much time to invest in a solution before going back to the customer to re-verify the requirement.

What’s your experience?

Have a good story to share? Have a critique? Post your comments or drop me a line to continue the conversation.


Tags: , , , ,
Posted in Process, Software Development | No Comments »