Latest Posts »
Latest Comments »
Popular Posts »

Introduction to Clustering

Written by Kendall Miller on February 27, 2008 – 12:58 am

Clustering takes a group of like devices (often servers, but it applies equally to appliances) together so they act, at least in some respects, like one device. Generally clusters are created to provide greater scalability at a lower price point or better availability (or both). To simplify matters, we’re going to restrict our discussion to clustering for network appliances (like firewalls) and common IT uses such as web servers, database servers, etc. In particular, we’re going to exclude grid computing (also known as compute clusters) and some other boundary cases. If you’re working in one of them, you’re probably not reading this introduction to clustering.

First a little lingo…

To make it easier to discuss below, lets introduce a few terms and define how they’ll be used in the rest of this article.

The general term for each computer or appliance that is a member of a cluster is a node. In general, each node is identical with respect to the service being clustered (e.g. if a web site is being clustered, all nodes have the same opinion of what that web site is).

The two main types of clustering are High-availability (HA) or failover clusters and Load-balancing clusters. In both cases more than one system can handle a given service, but they differ in whether multiple systems can be active at the same time (they can for load-balancing clusters, they can’t for high-availability clusters). Because this is the primary distinction, I prefer to use the terms failover and load-balancing because both provide high availability. In broad strokes, load balancing clusters are generally preferable to failover clusters because you get value all of the time for your investment in high availability (additional throughput) and there is generally little or no delay in moving resources from a system that fails.

Failover Clusters

Failover clusters…

  • Provide high availability only, they do not improve performance at best… there may even be a slight drop in performance depending on how the clustering is done.
  • Often have a short delay in transitioning resources from one active node to another. Requests that come during that time can fail.
  • Often require each node in the cluster to be absolutely identical for reliable operation.

Common Examples

Failover clustering is your best bet for clustering resources that due to technology constraints can’t be done in a load balanced cluster. This is usually anything that rapidly writes data (like databases) or anything with tight network-level performance constraints (because of how TCP/IP works, it’s very hard to make very low level load balancing work). In most companies, the key reason they implement this is for their firewall and their database server.

  • Microsoft Cluster Service (MSCS): This is the built-in Windows method of creating failover clusters. It supports Microsoft SQL Server, Exchange Server, file shares, and a range of other systems out of the box. It generally uses shared storage (a SAN is highly recommended, but it can be done with direct attach storage or anything else where you can replicate the storage absolutely) to keep each node data synchronized. For more information, see Why You Should Use MSCS.
  • Firewalls and Hardware Load Balancers: Most network-layer devices use this for high availability, such as firewalls from companies like Watchguard and Cisco and hardware load balancers from companies like Foundry and F5. Note that in this case we’re talking about the appliances themselves, even though they may be what performs load balancing for a cluster (see below).

Application Compatibility

Generally this is easier to ensure application compatibility than load balancing because it preserves the general characteristics of running without clustering: The application is only running in one place at a time, it has exclusive access to its storage, etc. For example, Microsoft Cluster Service (MSCS) can generally be used to cluster anything that’s a windows service without the service being specifically designed for it. Validation is also generally simpler for custom applications because it will tend to be binary - either it works and fails back & forth correctly, or it will fail pretty early in testing. Load balanced clusters conceptually have a much larger number of scenarios to test to exhaustively prove they work.

Load-balancing Clusters (aka server farms)

Load-balancing clusters:

  • Provide high availability and improve scalability. Each node is processing requests so you can process more requests at the same time.
  • Can be transparent or nearly so when a node fails.
  • Usually accommodate diverse nodes with different performance capabilities, software load, etc.

Common Examples

The most common load balanced cluster is a front-end web server. This is because of the natural tendency to separate state management (storage) from the web application (often into a database) removing the first, largest hurdle to load balancing. Additionally, web applications are often developed very quickly using technologies that are not optimized for performance. This tends to make them processor & memory intensive under load which can be very cost-effectively addressed with hardware instead of custom development.

  • Microsoft Windows Network Load Balancing (NLB): This performs basic load-balancing, typically for web servers but it can be used for other systems in certain cases. There are significant limitations in network scalability and management tools. The network scalability limitations depend highly on how sophisticated your network switching hardware is.
  • Load Balancing Appliance: F5 Networks BIG-IP have long been considered the gold standard in hardware load balancing appliances, but are difficult to spec up and administer unless you’re used to old-school UNIX administration. They are also very expensive when all you need is web site load balancing. There are a range of options that generally fall into two price classes based on whether the vendor believes they can accomplish anything for anyone (like Cisco, F5 Networks, etc.) or are just focused on web server requirements, which generally cost substantially less and are easier to configure. If you don’t have experience with the particular hardware appliance you’ve selected, you should get some expert assistance to select and setup your solution. Be sure to get sufficient knowledge transfer to perform routine support on your own.

Application Compatibility

Ideally, each application you want to cluster will have a section describing their compatibility with load balanced clustering. It is typical to have slight configuration changes for clustering. For example, a clustered web application may need to be configured to store state within a database instead of the normal in-memory storage. If no such information is available, some basic validation can be done to see if it’s worth even attempting. If the application looks like it can be plausibly clustered, then a plan for carefully validating the clustering should be performed before it is put into production.

Testing Clusters

The Wire Never Lies

First, if you are not using an absolutely off-the-rack clustering scenario, you will need to get ready to inspect network traffic. While Microsoft has included a free tool to do so with Windows, I highly recommend Ethereal WireShark as the gold standard. It’s been said that “the wire never lies”, meaning that the physical network represents the real truth of what’s going on. Any senior server administrator should be able to do a network trace and understand what is communicating and why from the perspective of each server. The reason this is particularly important with clustering is that it will give you absolute proof of where traffic is going between each layer of your infrastructure, and can reveal unexpected surprises such as redirects you didn’t believe were happening. Web browsers, particularly IE, are designed for end users, so they tend to hide the true underlying network details or simplify what’s going on. Don’t trust what they present when validating a cluster or diagnosing an issue. Trust the actual packets on the wire. For more on how to do this, see The Wire Never Lies.

Failover Clusters

The big test whenever changing the configuration of your cluster is that it can successfully failover, work, and fail back. You want to be sure this works on command so that it’s ready to take over when called upon due to a real problem. It’s not good to discover that your redundant node won’t run the software correctly, automatically, when you have a failure in the active node.

Network Test Points

Because clustering will tend to play some interesting tricks at the physical network layer, you should test your clustering installation from at least two places: On the same routed network segment as the clustered IP Address and on another segment. It’s also useful to test on the same physical switch and a different switch. The reason for this is you want to know how quickly the transition will be considered effective by clients on the network, and this will vary depending on exactly how the clustering is done. For example, if the IP address is transferred but the MAC address isn’t, it can take a while before clients on the same network segment (that may have the MAC address cached) will drop their cache and ARP again for the new address. In the case of using Windows NLB, it requires a switch that correctly supports IGMP to work correctly. If the switch doesn’t work correctly, what will tend to happen is that you will get alternating failures and successes as the switch incorrectly routes traffic to just one NLB node. This is just an example, but it highlights that you want to think about how your traffic travels from the client to the server and what it passes through that has to understand about the clustered node. Typically this is limited to routers & switches on the same routed segment.

How has clustering benefited you?

What types of clustering do you use? Has it made a material difference in your reliability? Post your comments or drop me a line to continue the conversation.


Tags: , , ,
Posted in Clustering |

Leave a Comment