Latest Posts »
Latest Comments »
Popular Posts »

First, Fly the Plane

Written by Kendall Miller on March 16, 2008 – 8:45 pm

I used to work with a former Navy A-6 pilot and instructor.  One of his standard techniques for helping pilots deal with emergencies was to train them to take an immediate action when they noticed the problem - an action that had no consequence but would fill the need to do something.  What he trained them to do was reset the built-in timer clock as soon as they noticed the problem.  Ostensibly, this was to help them downstream know how long a problem had happened, but its true purpose was to give them a single, standard action to fill the human need to do something, then they could take time to reflect on the problem.  Step two on the checklist was fly the plane. There have been several CFIT accidents where pilots were too busy troubleshooting a problem to avoid the ground.  The pilots forgot their first responsibility: make sure you put flying the plane in front of any other activity.

When doing IT Operations, there’s a lot you can learn from aviation.  I’ve seen several situations where technicians have caused much larger problems while troubleshooting small ones.  This comes from the same mindset that caused air crashes:  you become so focused on the immediate problem that you are no longer aware of your environment. The longer you work at a problem, the more likely this will happen.

A few team techniques you can use to help avoid this:

  • The Two Person Rule: Have two technicians involved in the problem with one taking the immediate actions and the other taking a longer view.
  • Separate Diagnostics from Remediation: Break your approach into non-invasive diagnostic activities before remediation attempts. This gives you a discrete point before you start putting thing at risk to recheck your assumptions about dependencies and risks to other systems.
  • Peer Review: Before approaching a problem, discuss your approach with two other people on your team (at the same time). If that approach isn’t successful or you need to deviate from it, reconvene the group to discuss again.

In many ways this is an extension of Don’t Taunt the Bear.  When working on a problem during business hours (or, if you like, non-maintenance hours) before taking anything off line, even for a moment, ask yourself:  Do I need to take this action right now?  How sure am I that it won’t have any unexpected consequences?  Is the risk I’m wrong worth the benefit of doing this right now?
All of this may sound like it’s going to add time to problem resolution, and it might - however remember that your first responsibility is to keep services flowing to your users. Most users will be unsympathetic if they lose access to their home directories because you were troubleshooting a problem with the printer in accounting and took down the same services that shared files.


Tags: , , ,
Posted in Infrastructure | No Comments »

Don’t Taunt the Bear

Written by Kendall Miller on March 10, 2008 – 12:45 am

When I first started at John Deere, I was working in a division that deployed systems to dealerships. Up until that point, they hadn’t done anything with hardware RAID. Dealerships are extremely cost-conscious, and while I was a huge believer in the value of hardware RAID arrays, they needed to prove their merit. At that time, HP was the preferred vendor for dealership equipment so I had gotten them to provide us a demonstration server with a hardware RAID card so I could show it off. The high point of my demo to the service staff was when I pulled a drive out of the running server while it was in the middle of running a very visible, high load process - and to everyone’s surprise it would just keep running! The first time I did the demo, it worked great - I pulled out the first drive and the server didn’t miss a beat.

A day later I was doing the same demo for a group of managers. The previous day’s work had been fruitful - it had gotten the attention I wanted and now a higher group wanted to discuss it. This time around, someone raised the question “so, any drive can fail and the system keeps running?” With much bravado I replied “sure! Watch!” and pulled out the second drive. Two seconds later to my shock the system froze and then went to a blue screen.

This was when I discovered that, unlike the Compaq systems I was used to the HP system didn’t automatically rebuild by default when you reinserted the drive.

I took a number of lessons away from this:

  1. Don’t assume each vendor’s equipment works the same way, even if that way seems to make a lot of sense.
  2. There is almost no amount of check & recheck that is too much when removing redundant components.

When you work with systems designed for high reliability, it’s often tempting to take advantage of the innate redundancy of the system to allow you to be somewhat more cavalier in your operational procedures. For example, if you have two web servers that are part of a load balance cluster, conceptually you can take one offline, reboot it and do whatever - right in the middle of the day when it’s convenient to your IT staff. On the surface, there’s nothing wrong with this - if everything operates as designed, you should be able to rip out the second server and do whatever you want without causing a problem. It’s very tempting to forget the cluster while working on the server.

However, it often pays to be vigilant in this circumstance. Don’t taunt the bear - just because it shouldn’t cause a problem, doesn’t mean it won’t cause a problem. For example - what if during the reboot the server comes back on line? Depending on how exactly your load balancing system works it may start getting new requests because it appears to be operational. It’s very hard to explain to your peers and the rest of the business why you went offline because you took a shortcut.

There is a fine line between taking advantage of redundancy and causing problems.

Don’t count on Redundancy

At a SaaS company I worked for we had a highly redundant SAN. Each server had two cards, they connected to two independent switches which in turn each had a connection to the two storage processors that ran the array. The whole system was designed and certified by the vendor to operate without interruption in the face of a failure of a card, switch, storage processor, etc. It also was designed to be continuously operational while having every component upgraded - the firmware of the switch, the storage processors, etc.

This highly redundant design opens the possibility of performing configuration changes, firmware upgrades, even component replacement during the day while business is going on - after all, it should work just fine. This is a good example of being tempted to taunt the bear - just because a system should be redundant and not have a problem with what you’re doing, don’t bank on that capability if you don’t have a compelling reason to do so. If you have to do it, don’t rely on automatic redundancy behavior - manually take the component offline.

Treat the bear with respect. If you can, schedule work for maintenance time periods so that if there is a service interruption it will have the smallest impact. If you have a good deal of experience that a particular action won’t cause a problem then you might perform it just outside of business hours instead of during maintenance time periods (which are often in the dark of night).

Restoring Redundancy

The rules change a little when dealing with a failure. For example, if you have a drive fail in a redundant array and get in a new drive you have to balance the competing goals of restoring redundancy and the risk of replacing the drive. There are number of risk elements in replacing a failed drive:

  1. You could pull the wrong drive, causing the whole array to fail.
  2. The physical disconnection of the drive could cause a SCSI bus reset or some other momentary interruption of data on the array.
  3. The new drive could be electrically defective and short the bus.
  4. Mechanically inserting the drive could disrupt the bus or jar another drive or other physical part, causing the array to fail.

So, how do you balance the desire to replace the failed drive with the risks of causing the array to fail?

  1. If the system is stable and still redundant, wait until the next scheduled maintenance period to perform corrective action. There’s no rush.
  2. If it is not redundant, but operable, you need to balance risk with benefit. It is very unlikely that an independent part will fail within 24 hours of another failure, so you can almost always wait until a low activity time outside of business hours or even in the middle of the night to replace the component.
  3. If the system is not stable, you have the most difficult decision. First, don’t make this on your own. Get together at least the available IT engineers and, if at all possible, a representative of the business process(es) affected by the problem. You need to balance the current instability with the probability that you will make it worse by changing the system. If it’s just a dead drive, this is pretty easy: Low risk, high benefit (however it’s unlikely you’d be in an unstable situation if this happened).

Lockout / Tagout

Clustering systems combine the ability to automatically recognize when a node is down (automatic failover) and be manually told to ignore a node (manual failover). Before performing invasive work on a node in the cluster that has been taken offline automatically, go back to the clustering system and place the node offline manually. Think of this as being the equivalent of procedures used when working with dangerous machinery - Lockout/Tagout. Straight from our friends at OSHA:

“Lockout/Tagout (LOTO)” refers to specific practices and procedures to safeguard employees from the unexpected energization or startup of machinery and equipment, or the release of hazardous energy during service or maintenance activities.

This is exactly what we want to do - make sure while we’re performing actions that impair the availability of part of a reliable system we have the cluster configured so that the part can’t be accidentally used. There are two parts of this: Lock out the item so it can’t be unintentionally accessed and tag the device so that everyone knows that it’s locked out. You want to be clear on how to accomplish both for each cluster you have. The latter may take the form of just notification - an email to your support team - or a post on a central site. The point is you need a big, visible way of clearly communicating the status of the device.

If your clustering mechanism doesn’t have a way of doing this, or it relies on the node itself (such as Windows NLB) you should consider it always live and dangerous.

Nice Bear. Friendly Bear.

If the bear is working well, let him continue doing what he’s doing. Your running system should be treated with respect at all times, because there is a great deal of complexity that goes into each of the elements and how they work together, even if it appears simple on the outside. As a person responsible for a reliable system, you need to always be thinking in the long term. You don’t want to cause an outage just to deploy an upgraded component or firmware. Almost without question, the theoretical issues fixed by the firmware update aren’t going to be as important to your customers are the real issues caused by a service interruption.


Tags: , , ,
Posted in Infrastructure | 1 Comment »

Reliability is a Mindset

Written by Kendall Miller on March 6, 2008 – 12:44 am

Last week I was attending a training course on sales from a company I really respect - EntreQuest. One of the things I love about their courses and consulting is they aren’t shy about getting right to the fundamental (and often fundamentally hard) human basis for problems. One of the things they emphasize is that results are driven by process (including technology) which is in turn driven by mindset. If you don’t have the right mindset, you won’t achieve the results regardless of how much technology you throw at it. This is the basic justification for why the success rates of telemarketing (and other sales efforts that are all process, no mindset) are so low.

What was interesting to me in particular about this was how well it relates to conversations I typically have about reliability. Depending on where someone is in their experience curve they may talk about a particular technology, software development practice, or problem they’ve had. If they are really experienced, they go directly to either processes or culture. The very best tend to just talk about culture and mindset. This is bad news

In engineering the terms vary slightly, but I believe the principles are still completely valid: Results are driven by technology (Technology includes the processes, software, and hardware.), Technology is driven by mindset. When a mindset is held by a company, it’s called the culture. Your culture will exert a constant pressure on your technology like the current in a river: Either it will reinforce your goals or work against them.

You can make short term or localized improvements by focusing on just the results or technology, but to make a lasting change you need to be moving with the current.

Establish a Reliability Culture

Within your department or company (whatever scope you can influence), make reliability a fundamental aspect of who you are and how you solve problems. If you instill a mindset behind every discussion that your solutions will scale to a certain size, be continuously available, or other aspects of reliability, your technology choices will be imbued with this stance:

  • Your development process will be designed to reduce or eliminate reliability risks. When your business partners ask for a change at the last minute, you won’t have to explain that all changes are risky.
  • You won’t talk yourself into short-cutting testing. Instead, you’ll structure your development process to drive testing automation to reduce the cost of testing (allow you to run full tests more often) and ensure consistency.
  • Your developers will naturally avoid low-reliability personal practices like being possessive about code, not commenting, incomplete or inconsistent error handling, and poor configuration management strategies.
  • Your deployment environment will have appropriate hardware and software. You will be able to get proper monitoring tools and use hardware with sufficient redundancy and performance.
  • Your business partners will be more receptive to conversations about schedules, knowing that under pressure they have to give on functionality instead of reliability.

As reliability becomes a core element of your culture, each individual will start to see the thousands of little decisions they make each day differently and unconsciously approach them from a perspective of reliability, as if they asked “what is the most reliable way to accomplish (whatever I’m doing now)”. At its best, it will shift things that happen as conflicts between people into corporate discussions - instead of your business partners feeling they have to convince you personally to add a new feature (viewing you as the roadblock) it will become how do we accommodate a business need within the context of our corporate goals for reliability. It is significantly easier to create a partnership in this scenario that has you understand their goals and them understand yours because you have a shared value and commitment to work within.

Reliability won’t always win out

Even in a reliability culture, there are very sound reasons to do certain things that entail risk. No one element of your culture is absolute, but it must always be respected and considered. For example, it could be that the system in question is an internal system that has a limited ROI. In this case it just isn’t appropriate to invest a great deal in reliability at the expense of ROI unless the system is unusable without it. Alternately, it is often appropriate during startup phases when the downside cost of a reliability problem is low (e.g. there are no or few existing users or no performance guarantee) or the mitigation cost is excessive (e.g. geodiverse hot sites).

Having reliability as a fundamental part of your mindset is still helpful in these situations because it ensures that a decision that impacts reliability is deliberately made and openly understood. As a company, you have to choose your battles and what risks you are going to mitigate. In some cases, it’s best to just run the risk and wait to see if it manifests before pouring energy into fixing it. Alternately, the risk may be scalability - if you are wildly successful, you’ll have to change your software to handle it. This is often called Technology Debt.

Taking on technology debt is often necessary when starting a product or venturing in new territory. The key is that the business and technology parties know that it’s a deliberate decision to take on that debt, instead of it being a quiet decision made just within the development team. That way if the risks turn out to become reality, the business doesn’t burn time arguing about how you got where you are and instead recognizes that it was a deliberate and well considered decision that now has a consequence that must be handled.

Reliability isn’t always Suitable

Not every company should have reliability as the defining element of its culture. It isn’t necessarily that these companies don’t want reliable results, it’s more that reliability isn’t their differentiator or important enough to be a core element of the culture. For example: Compare the Linksys and Cisco brands. Both can sell you a Wireless-G access point that on the mainline specifications are comparable: They support the same primary standards, offer comparable throughput and security features (for most people), and to many customers they would be indistinguishable.  However, Linksys tends to produce a model, make a few essential firmware updates and move on. If the unit needs to be reset periodically or a new device shows up on the scene that causes a problem with it that’s potentially OK. Customers that pay $70 instead of $700 for a wireless access point aren’t expecting the same degree of reliability. If Linksys attempted to do all of the reliability testing that Cisco does, they wouldn’t be able to hit the price points or time to market that drives their brand. Their product must be reliable enough that customers will find it suitable for the target market, but it isn’t necessary to pursue ultimate reliability.

Take a hard look around you. What level of reliability is appropriate for each area of your business? What are the reliability goals of the company? What is the prevailing culture? If you find yourself out of sync with your company’s goals on reliability, it could be that it isn’t the company that needs to shift but rather you may need to explore other options.

Change Begins at Home

The next time you’re frustrated by the results your team is achieving, don’t leap on the technology bandwagon first. Back up and look at how you might incorporate a reliability mindset into your own work as a starting point for catalyzing broader change. Have a series of conversations in your team to ensure you establish a common understanding of what your principles are - not just with reliability but other guiding principles as well. From that it will become easier to know what technologies (software, hardware, processes) will support the results you want to achieve. Start with your team and move out through your company, the results can speak for themselves.


Tags: , ,
Posted in Infrastructure, Software Development | No Comments »

Two Person Rule

Written by Kendall Miller on March 3, 2008 – 10:46 pm

Whenever working on the components of a high reliability system, remember that the biggest single cause of availability problems are people - generally through clicking the wrong thing, typing the wrong instruction, or not seeing the consequences of an action. A good procedure to minimize the risk of unintended harm while working on an important system (whether it’s clustered or not) is to have two people involved in the physical work. It’s the IT Operations equivalent of pair programming. For example, if you are taking a cluster node offline you want to be sure you take the right one offline. Even in Aviation where there are good procedures to avoid mistakes like this, it still happens and can cost lives. Your situation isn’t as dire, but the principle remains the same: When performing operations that can directly impair your availability, use an obvious two person structure to make sure you do the right thing:

  1. Say what you’re going to do.
  2. Have the second person confirm that it’s the right thing and you’re on the right one.
  3. Perform the action.

It may feel pedantic, but it will keep you focused on what you’re doing and ensure you don’t have to explain why you deactivated the perfectly good node of the cluster. The principle works whenever you’re doing something that has the potential to impact your availability.  It also provides good cross-training experience with the less-experienced person driving and the more-experienced person looking ahead to the larger tasks.  Unlike pair programming, it really isn’t necessary to switch roles through the process.  Instead, consider it more like pilot and navigator with the navigator referencing checklists, procedures, and verifying selections and the pilot performing each action.


Tags:
Posted in Infrastructure | No Comments »

Introduction to Clustering

Written by Kendall Miller on February 27, 2008 – 12:58 am

Clustering takes a group of like devices (often servers, but it applies equally to appliances) together so they act, at least in some respects, like one device. Generally clusters are created to provide greater scalability at a lower price point or better availability (or both). To simplify matters, we’re going to restrict our discussion to clustering for network appliances (like firewalls) and common IT uses such as web servers, database servers, etc. In particular, we’re going to exclude grid computing (also known as compute clusters) and some other boundary cases. If you’re working in one of them, you’re probably not reading this introduction to clustering.

First a little lingo…

To make it easier to discuss below, lets introduce a few terms and define how they’ll be used in the rest of this article.

The general term for each computer or appliance that is a member of a cluster is a node. In general, each node is identical with respect to the service being clustered (e.g. if a web site is being clustered, all nodes have the same opinion of what that web site is).

The two main types of clustering are High-availability (HA) or failover clusters and Load-balancing clusters. In both cases more than one system can handle a given service, but they differ in whether multiple systems can be active at the same time (they can for load-balancing clusters, they can’t for high-availability clusters). Because this is the primary distinction, I prefer to use the terms failover and load-balancing because both provide high availability. In broad strokes, load balancing clusters are generally preferable to failover clusters because you get value all of the time for your investment in high availability (additional throughput) and there is generally little or no delay in moving resources from a system that fails.

Failover Clusters

Failover clusters…

  • Provide high availability only, they do not improve performance at best… there may even be a slight drop in performance depending on how the clustering is done.
  • Often have a short delay in transitioning resources from one active node to another. Requests that come during that time can fail.
  • Often require each node in the cluster to be absolutely identical for reliable operation.

Common Examples

Failover clustering is your best bet for clustering resources that due to technology constraints can’t be done in a load balanced cluster. This is usually anything that rapidly writes data (like databases) or anything with tight network-level performance constraints (because of how TCP/IP works, it’s very hard to make very low level load balancing work). In most companies, the key reason they implement this is for their firewall and their database server.

  • Microsoft Cluster Service (MSCS): This is the built-in Windows method of creating failover clusters. It supports Microsoft SQL Server, Exchange Server, file shares, and a range of other systems out of the box. It generally uses shared storage (a SAN is highly recommended, but it can be done with direct attach storage or anything else where you can replicate the storage absolutely) to keep each node data synchronized. For more information, see Why You Should Use MSCS.
  • Firewalls and Hardware Load Balancers: Most network-layer devices use this for high availability, such as firewalls from companies like Watchguard and Cisco and hardware load balancers from companies like Foundry and F5. Note that in this case we’re talking about the appliances themselves, even though they may be what performs load balancing for a cluster (see below).

Application Compatibility

Generally this is easier to ensure application compatibility than load balancing because it preserves the general characteristics of running without clustering: The application is only running in one place at a time, it has exclusive access to its storage, etc. For example, Microsoft Cluster Service (MSCS) can generally be used to cluster anything that’s a windows service without the service being specifically designed for it. Validation is also generally simpler for custom applications because it will tend to be binary - either it works and fails back & forth correctly, or it will fail pretty early in testing. Load balanced clusters conceptually have a much larger number of scenarios to test to exhaustively prove they work.

Load-balancing Clusters (aka server farms)

Load-balancing clusters:

  • Provide high availability and improve scalability. Each node is processing requests so you can process more requests at the same time.
  • Can be transparent or nearly so when a node fails.
  • Usually accommodate diverse nodes with different performance capabilities, software load, etc.

Common Examples

The most common load balanced cluster is a front-end web server. This is because of the natural tendency to separate state management (storage) from the web application (often into a database) removing the first, largest hurdle to load balancing. Additionally, web applications are often developed very quickly using technologies that are not optimized for performance. This tends to make them processor & memory intensive under load which can be very cost-effectively addressed with hardware instead of custom development.

  • Microsoft Windows Network Load Balancing (NLB): This performs basic load-balancing, typically for web servers but it can be used for other systems in certain cases. There are significant limitations in network scalability and management tools. The network scalability limitations depend highly on how sophisticated your network switching hardware is.
  • Load Balancing Appliance: F5 Networks BIG-IP have long been considered the gold standard in hardware load balancing appliances, but are difficult to spec up and administer unless you’re used to old-school UNIX administration. They are also very expensive when all you need is web site load balancing. There are a range of options that generally fall into two price classes based on whether the vendor believes they can accomplish anything for anyone (like Cisco, F5 Networks, etc.) or are just focused on web server requirements, which generally cost substantially less and are easier to configure. If you don’t have experience with the particular hardware appliance you’ve selected, you should get some expert assistance to select and setup your solution. Be sure to get sufficient knowledge transfer to perform routine support on your own.

Application Compatibility

Ideally, each application you want to cluster will have a section describing their compatibility with load balanced clustering. It is typical to have slight configuration changes for clustering. For example, a clustered web application may need to be configured to store state within a database instead of the normal in-memory storage. If no such information is available, some basic validation can be done to see if it’s worth even attempting. If the application looks like it can be plausibly clustered, then a plan for carefully validating the clustering should be performed before it is put into production.

Testing Clusters

The Wire Never Lies

First, if you are not using an absolutely off-the-rack clustering scenario, you will need to get ready to inspect network traffic. While Microsoft has included a free tool to do so with Windows, I highly recommend Ethereal WireShark as the gold standard. It’s been said that “the wire never lies”, meaning that the physical network represents the real truth of what’s going on. Any senior server administrator should be able to do a network trace and understand what is communicating and why from the perspective of each server. The reason this is particularly important with clustering is that it will give you absolute proof of where traffic is going between each layer of your infrastructure, and can reveal unexpected surprises such as redirects you didn’t believe were happening. Web browsers, particularly IE, are designed for end users, so they tend to hide the true underlying network details or simplify what’s going on. Don’t trust what they present when validating a cluster or diagnosing an issue. Trust the actual packets on the wire. For more on how to do this, see The Wire Never Lies.

Failover Clusters

The big test whenever changing the configuration of your cluster is that it can successfully failover, work, and fail back. You want to be sure this works on command so that it’s ready to take over when called upon due to a real problem. It’s not good to discover that your redundant node won’t run the software correctly, automatically, when you have a failure in the active node.

Network Test Points

Because clustering will tend to play some interesting tricks at the physical network layer, you should test your clustering installation from at least two places: On the same routed network segment as the clustered IP Address and on another segment. It’s also useful to test on the same physical switch and a different switch. The reason for this is you want to know how quickly the transition will be considered effective by clients on the network, and this will vary depending on exactly how the clustering is done. For example, if the IP address is transferred but the MAC address isn’t, it can take a while before clients on the same network segment (that may have the MAC address cached) will drop their cache and ARP again for the new address. In the case of using Windows NLB, it requires a switch that correctly supports IGMP to work correctly. If the switch doesn’t work correctly, what will tend to happen is that you will get alternating failures and successes as the switch incorrectly routes traffic to just one NLB node. This is just an example, but it highlights that you want to think about how your traffic travels from the client to the server and what it passes through that has to understand about the clustered node. Typically this is limited to routers & switches on the same routed segment.

How has clustering benefited you?

What types of clustering do you use? Has it made a material difference in your reliability? Post your comments or drop me a line to continue the conversation.


Tags: , , ,
Posted in Clustering | No Comments »