Archive for the ‘Infrastructure’ Category
Reliability is a Mindset
Written by Kendall Miller on March 6, 2008 – 12:44 amLast week I was attending a training course on sales from a company I really respect – EntreQuest. One of the things I love about their courses and consulting is they aren’t shy about getting right to the fundamental (and often fundamentally hard) human basis for problems. One of the things they emphasize is that results are driven by process (including technology) which is in turn driven by mindset. If you don’t have the right mindset, you won’t achieve the results regardless of how much technology you throw at it. This is the basic justification for why the success rates of telemarketing (and other sales efforts that are all process, no mindset) are so low.
What was interesting to me in particular about this was how well it relates to conversations I typically have about reliability. Depending on where someone is in their experience curve they may talk about a particular technology, software development practice, or problem they’ve had. If they are really experienced, they go directly to either processes or culture. The very best tend to just talk about culture and mindset. This is bad news
In engineering the terms vary slightly, but I believe the principles are still completely valid: Results are driven by technology (Technology includes the processes, software, and hardware.), Technology is driven by mindset. When a mindset is held by a company, it’s called the culture. Your culture will exert a constant pressure on your technology like the current in a river: Either it will reinforce your goals or work against them.
You can make short term or localized improvements by focusing on just the results or technology, but to make a lasting change you need to be moving with the current.
Establish a Reliability Culture
Within your department or company (whatever scope you can influence), make reliability a fundamental aspect of who you are and how you solve problems. If you instill a mindset behind every discussion that your solutions will scale to a certain size, be continuously available, or other aspects of reliability, your technology choices will be imbued with this stance:
- Your development process will be designed to reduce or eliminate reliability risks. When your business partners ask for a change at the last minute, you won’t have to explain that all changes are risky.
- You won’t talk yourself into short-cutting testing. Instead, you’ll structure your development process to drive testing automation to reduce the cost of testing (allow you to run full tests more often) and ensure consistency.
- Your developers will naturally avoid low-reliability personal practices like being possessive about code, not commenting, incomplete or inconsistent error handling, and poor configuration management strategies.
- Your deployment environment will have appropriate hardware and software. You will be able to get proper monitoring tools and use hardware with sufficient redundancy and performance.
- Your business partners will be more receptive to conversations about schedules, knowing that under pressure they have to give on functionality instead of reliability.
As reliability becomes a core element of your culture, each individual will start to see the thousands of little decisions they make each day differently and unconsciously approach them from a perspective of reliability, as if they asked “what is the most reliable way to accomplish (whatever I’m doing now)”. At its best, it will shift things that happen as conflicts between people into corporate discussions – instead of your business partners feeling they have to convince you personally to add a new feature (viewing you as the roadblock) it will become how do we accommodate a business need within the context of our corporate goals for reliability. It is significantly easier to create a partnership in this scenario that has you understand their goals and them understand yours because you have a shared value and commitment to work within.
Reliability won’t always win out
Even in a reliability culture, there are very sound reasons to do certain things that entail risk. No one element of your culture is absolute, but it must always be respected and considered. For example, it could be that the system in question is an internal system that has a limited ROI. In this case it just isn’t appropriate to invest a great deal in reliability at the expense of ROI unless the system is unusable without it. Alternately, it is often appropriate during startup phases when the downside cost of a reliability problem is low (e.g. there are no or few existing users or no performance guarantee) or the mitigation cost is excessive (e.g. geodiverse hot sites).
Having reliability as a fundamental part of your mindset is still helpful in these situations because it ensures that a decision that impacts reliability is deliberately made and openly understood. As a company, you have to choose your battles and what risks you are going to mitigate. In some cases, it’s best to just run the risk and wait to see if it manifests before pouring energy into fixing it. Alternately, the risk may be scalability – if you are wildly successful, you’ll have to change your software to handle it. This is often called Technology Debt.
Taking on technology debt is often necessary when starting a product or venturing in new territory. The key is that the business and technology parties know that it’s a deliberate decision to take on that debt, instead of it being a quiet decision made just within the development team. That way if the risks turn out to become reality, the business doesn’t burn time arguing about how you got where you are and instead recognizes that it was a deliberate and well considered decision that now has a consequence that must be handled.
Reliability isn’t always Suitable
Not every company should have reliability as the defining element of its culture. It isn’t necessarily that these companies don’t want reliable results, it’s more that reliability isn’t their differentiator or important enough to be a core element of the culture. For example: Compare the Linksys and Cisco brands. Both can sell you a Wireless-G access point that on the mainline specifications are comparable: They support the same primary standards, offer comparable throughput and security features (for most people), and to many customers they would be indistinguishable. However, Linksys tends to produce a model, make a few essential firmware updates and move on. If the unit needs to be reset periodically or a new device shows up on the scene that causes a problem with it that’s potentially OK. Customers that pay $70 instead of $700 for a wireless access point aren’t expecting the same degree of reliability. If Linksys attempted to do all of the reliability testing that Cisco does, they wouldn’t be able to hit the price points or time to market that drives their brand. Their product must be reliable enough that customers will find it suitable for the target market, but it isn’t necessary to pursue ultimate reliability.
Take a hard look around you. What level of reliability is appropriate for each area of your business? What are the reliability goals of the company? What is the prevailing culture? If you find yourself out of sync with your company’s goals on reliability, it could be that it isn’t the company that needs to shift but rather you may need to explore other options.
Change Begins at Home
The next time you’re frustrated by the results your team is achieving, don’t leap on the technology bandwagon first. Back up and look at how you might incorporate a reliability mindset into your own work as a starting point for catalyzing broader change. Have a series of conversations in your team to ensure you establish a common understanding of what your principles are – not just with reliability but other guiding principles as well. From that it will become easier to know what technologies (software, hardware, processes) will support the results you want to achieve. Start with your team and move out through your company, the results can speak for themselves.
Tags: Mindset, Reliability, Technology Debt
Posted in Infrastructure, Software Development | No Comments »
Two Person Rule
Written by Kendall Miller on March 3, 2008 – 10:46 pmWhenever working on the components of a high reliability system, remember that the biggest single cause of availability problems are people – generally through clicking the wrong thing, typing the wrong instruction, or not seeing the consequences of an action. A good procedure to minimize the risk of unintended harm while working on an important system (whether it’s clustered or not) is to have two people involved in the physical work. It’s the IT Operations equivalent of pair programming. For example, if you are taking a cluster node offline you want to be sure you take the right one offline. Even in Aviation where there are good procedures to avoid mistakes like this, it still happens and can cost lives. Your situation isn’t as dire, but the principle remains the same: When performing operations that can directly impair your availability, use an obvious two person structure to make sure you do the right thing:
- Say what you’re going to do.
- Have the second person confirm that it’s the right thing and you’re on the right one.
- Perform the action.
It may feel pedantic, but it will keep you focused on what you’re doing and ensure you don’t have to explain why you deactivated the perfectly good node of the cluster. The principle works whenever you’re doing something that has the potential to impact your availability. It also provides good cross-training experience with the less-experienced person driving and the more-experienced person looking ahead to the larger tasks. Unlike pair programming, it really isn’t necessary to switch roles through the process. Instead, consider it more like pilot and navigator with the navigator referencing checklists, procedures, and verifying selections and the pilot performing each action.
Tags: two person rule
Posted in Infrastructure | No Comments »
Introduction to Clustering
Written by Kendall Miller on February 27, 2008 – 12:58 amClustering takes a group of like devices (often servers, but it applies equally to appliances) together so they act, at least in some respects, like one device. Generally clusters are created to provide greater scalability at a lower price point or better availability (or both). To simplify matters, we’re going to restrict our discussion to clustering for network appliances (like firewalls) and common IT uses such as web servers, database servers, etc. In particular, we’re going to exclude grid computing (also known as compute clusters) and some other boundary cases. If you’re working in one of them, you’re probably not reading this introduction to clustering.
First a little lingo…
To make it easier to discuss below, lets introduce a few terms and define how they’ll be used in the rest of this article.
The general term for each computer or appliance that is a member of a cluster is a node. In general, each node is identical with respect to the service being clustered (e.g. if a web site is being clustered, all nodes have the same opinion of what that web site is).
The two main types of clustering are High-availability (HA) or failover clusters and Load-balancing clusters. In both cases more than one system can handle a given service, but they differ in whether multiple systems can be active at the same time (they can for load-balancing clusters, they can’t for high-availability clusters). Because this is the primary distinction, I prefer to use the terms failover and load-balancing because both provide high availability. In broad strokes, load balancing clusters are generally preferable to failover clusters because you get value all of the time for your investment in high availability (additional throughput) and there is generally little or no delay in moving resources from a system that fails.
Failover Clusters
Failover clusters…
- Provide high availability only, they do not improve performance at best… there may even be a slight drop in performance depending on how the clustering is done.
- Often have a short delay in transitioning resources from one active node to another. Requests that come during that time can fail.
- Often require each node in the cluster to be absolutely identical for reliable operation.
Common Examples
Failover clustering is your best bet for clustering resources that due to technology constraints can’t be done in a load balanced cluster. This is usually anything that rapidly writes data (like databases) or anything with tight network-level performance constraints (because of how TCP/IP works, it’s very hard to make very low level load balancing work). In most companies, the key reason they implement this is for their firewall and their database server.
- Microsoft Cluster Service (MSCS): This is the built-in Windows method of creating failover clusters. It supports Microsoft SQL Server, Exchange Server, file shares, and a range of other systems out of the box. It generally uses shared storage (a SAN is highly recommended, but it can be done with direct attach storage or anything else where you can replicate the storage absolutely) to keep each node data synchronized. For more information, see Why You Should Use MSCS.
- Firewalls and Hardware Load Balancers: Most network-layer devices use this for high availability, such as firewalls from companies like Watchguard and Cisco and hardware load balancers from companies like Foundry and F5. Note that in this case we’re talking about the appliances themselves, even though they may be what performs load balancing for a cluster (see below).
Application Compatibility
Generally this is easier to ensure application compatibility than load balancing because it preserves the general characteristics of running without clustering: The application is only running in one place at a time, it has exclusive access to its storage, etc. For example, Microsoft Cluster Service (MSCS) can generally be used to cluster anything that’s a windows service without the service being specifically designed for it. Validation is also generally simpler for custom applications because it will tend to be binary – either it works and fails back & forth correctly, or it will fail pretty early in testing. Load balanced clusters conceptually have a much larger number of scenarios to test to exhaustively prove they work.
Load-balancing Clusters (aka server farms)
Load-balancing clusters:
- Provide high availability and improve scalability. Each node is processing requests so you can process more requests at the same time.
- Can be transparent or nearly so when a node fails.
- Usually accommodate diverse nodes with different performance capabilities, software load, etc.
Common Examples
The most common load balanced cluster is a front-end web server. This is because of the natural tendency to separate state management (storage) from the web application (often into a database) removing the first, largest hurdle to load balancing. Additionally, web applications are often developed very quickly using technologies that are not optimized for performance. This tends to make them processor & memory intensive under load which can be very cost-effectively addressed with hardware instead of custom development.
- Microsoft Windows Network Load Balancing (NLB): This performs basic load-balancing, typically for web servers but it can be used for other systems in certain cases. There are significant limitations in network scalability and management tools. The network scalability limitations depend highly on how sophisticated your network switching hardware is.
- Load Balancing Appliance: F5 Networks BIG-IP have long been considered the gold standard in hardware load balancing appliances, but are difficult to spec up and administer unless you’re used to old-school UNIX administration. They are also very expensive when all you need is web site load balancing. There are a range of options that generally fall into two price classes based on whether the vendor believes they can accomplish anything for anyone (like Cisco, F5 Networks, etc.) or are just focused on web server requirements, which generally cost substantially less and are easier to configure. If you don’t have experience with the particular hardware appliance you’ve selected, you should get some expert assistance to select and setup your solution. Be sure to get sufficient knowledge transfer to perform routine support on your own.
Application Compatibility
Ideally, each application you want to cluster will have a section describing their compatibility with load balanced clustering. It is typical to have slight configuration changes for clustering. For example, a clustered web application may need to be configured to store state within a database instead of the normal in-memory storage. If no such information is available, some basic validation can be done to see if it’s worth even attempting. If the application looks like it can be plausibly clustered, then a plan for carefully validating the clustering should be performed before it is put into production.
Testing Clusters
The Wire Never Lies
First, if you are not using an absolutely off-the-rack clustering scenario, you will need to get ready to inspect network traffic. While Microsoft has included a free tool to do so with Windows, I highly recommend Ethereal WireShark as the gold standard. It’s been said that “the wire never lies”, meaning that the physical network represents the real truth of what’s going on. Any senior server administrator should be able to do a network trace and understand what is communicating and why from the perspective of each server. The reason this is particularly important with clustering is that it will give you absolute proof of where traffic is going between each layer of your infrastructure, and can reveal unexpected surprises such as redirects you didn’t believe were happening. Web browsers, particularly IE, are designed for end users, so they tend to hide the true underlying network details or simplify what’s going on. Don’t trust what they present when validating a cluster or diagnosing an issue. Trust the actual packets on the wire. For more on how to do this, see The Wire Never Lies.
Failover Clusters
The big test whenever changing the configuration of your cluster is that it can successfully failover, work, and fail back. You want to be sure this works on command so that it’s ready to take over when called upon due to a real problem. It’s not good to discover that your redundant node won’t run the software correctly, automatically, when you have a failure in the active node.
Network Test Points
Because clustering will tend to play some interesting tricks at the physical network layer, you should test your clustering installation from at least two places: On the same routed network segment as the clustered IP Address and on another segment. It’s also useful to test on the same physical switch and a different switch. The reason for this is you want to know how quickly the transition will be considered effective by clients on the network, and this will vary depending on exactly how the clustering is done. For example, if the IP address is transferred but the MAC address isn’t, it can take a while before clients on the same network segment (that may have the MAC address cached) will drop their cache and ARP again for the new address. In the case of using Windows NLB, it requires a switch that correctly supports IGMP to work correctly. If the switch doesn’t work correctly, what will tend to happen is that you will get alternating failures and successes as the switch incorrectly routes traffic to just one NLB node. This is just an example, but it highlights that you want to think about how your traffic travels from the client to the server and what it passes through that has to understand about the clustered node. Typically this is limited to routers & switches on the same routed segment.
How has clustering benefited you?
What types of clustering do you use? Has it made a material difference in your reliability? Post your comments or drop me a line to continue the conversation.
Tags: Clustering, failover, NLB, Wireshark
Posted in Clustering | No Comments »
The Wire Never Lies
Written by Kendall Miller on February 25, 2008 – 12:59 amYou need to find and resolve a problem with your web or multi-tier application, and you need to do it quickly. It may be happening in production or in a place where you can’t easily set up a test environment or get a traditional debugger involved. Here’s an approach that will help you narrow down and in many cases resolve the issue. The best part is that in most cases it won’t require specialized knowledge of the language the application is written in.
Don’t be afraid to pick up a packet sniffer and look at the actual Ethernet packets running back and forth between the parts of your system. You’ll probably find the issue much more quickly than you think, and you can do this with an application in production without the original source code, at least enough to know what your options are. The wire never lies – it tells you exactly what your application is really doing over the network.
For the purposes of this article, consider a basic web application. It most likely has a set of code that runs on the web server (which could be in any language) and then talks with a back-end database, probably located on a different system if this is a large web application. Now take two common categories of problems: A performance issue and an occasional web site error.
Our basic approach is consistently:
- Find the layer of the architecture where the problem is being introduced by tracing the network
- Dissect what is happening in that layer down to the process that is introducing the problem.
- Review the implementation of just the affected commands in the suspect process to resolve the issue.
Our first goal is to narrow down what layer of the system is the most likely culprit – the web application or the database. When doing this, I’ve found that it pays to quickly pull out a tool that will tell me what’s going on across the network. This is where the wire never lies comes from: If you use a packet sniffer or some other tool to see what’s happening “on the wire”, you will know exactly what is going on between your network layers. Not what you think should happen or want to happen – what is actually happening. This is so important because we develop in a world with many layers of abstraction between what we write and the physical I/O commands that ultimately carry out our wishes.
Let’s start with an example of a performance problem, described as being that a user viewing a detail page in your web application is experiencing that it takes several seconds to display, and they believe it is getting slower over time.
Find the problem layer
In our example, we have several possibilities: The user establishes a connection from their web browser to the web server which in turn makes database calls to the database server. If clustering is involved it is somewhat more complicated because with a cluster it likely goes web browser to load balancing appliance to web server to database. Regardless, our first goal is to narrow down what layer of the architecture the problem is being introduced.
In the case of a performance problem, the layer that introduces the problem is the first layer that is taking up the majority of the time and not waiting on another layer. The quickest way to resolve this is to do some strategic network sniffing at key points in your infrastructure to watch the request be processed. This may not seem quick, but with practice it becomes very natural.
A good place to start is on the web server. In many cases sniffing the traffic at the web server alone is sufficient to find the entire problem because it sees the traffic to & from the web browser and upstream to the database server. You can use a variety of tools to do this, but I like Wireshark. It’s free, fast, and very capable. Microsoft also ships a basic network monitor, but it doesn’t have some of the neat-o features Wireshark has that make analysis quick. Until recently, Wireshark was called “Ethereal” but that name had to be changed due to copyright problems.
What we’re looking for is to compare the traffic to & from the web browser and what’s traveling off of the web server. We want to compare timings and volumes to understand what happens between when the web requests starts and when it completes. Do a complete packet capture of one problem web request, then get ready to spend some time understanding it.
The first thing you’ll likely notice is that there is a great deal more information here than you likely expected. Even a simple HTTP Get request results in a lot more network traffic than you might expect. If your site uses SSL, you’ll also discover that in fact the traffic to and from your web browser is encrypted – remember, we’re looking at what’s going on at Layer 2 of the network, so this is a good thing. If you’re using encryption within your own data center from the web server to the database server this is going to really get in your way (and you should ask yourself why you’re doing that as a general practice, but that’s another article). If your web site uses content compression the response will also look encrypted.
When analyzing a trace, do the following:
- Eliminate spurious client traffic: Filter out requests that aren’t from your test client. If they are part of the problem, it will generally still show up in calls the web server is making to the database or other systems, and you don’t need the volume.
- Narrow down the time window: You probably started the trace a few seconds before your hit, and ended it a few seconds after. Look for the first packet from your client’s IP address and eliminate everything before it, likewise look for the last packet To your client’s IP address and eliminate everything after.
- Look at timing: You want to survey the sequence of events to get a feel for what happened exactly in order. Your primary concern is going to be traffic you know could be related (such as to your SQL server) but don’t ignore authentication traffic, it can be a secret performance killer (time spent negotiating security between your web application and another server). Time spent on other servers will show as a quiet spot in the sequence – where a request has been sent off but the response hasn’t come back yet. Note that you need to be reading the timestamps to get a good feel for this; a lot of packets isn’t necessarily a bad thing – networks are very fast in general. If all the packets are happening in the space of 20 milliseconds, it isn’t your performance problem.
- Look at the volume: A quick way to get a feel for this is to use the ability for the packet sniffer to reassemble packets into a stream. This shows you the true conversation that is going on between the layers, and will show you how many bytes were moved to get it done. This is very helpful if you discover, for example, that you’re passing back very large recordsets you didn’t expect. Alternately, it could be that the data is simply inefficiently stored or packaged. For example, if a column in the database is configured for Unicode and the caller requests it in that format, it will take twice as much data across the wire to move it. XML data in the database can also get you in trouble by causing unexpectedly high volume.
- Look at the detail: If the problem isn’t apparent yet, look at the specific requests being made. For example, you may notice repeated requests that may indicate an error/retry cycle in the application.
It’s worth pointing out that a network volume problem like you would in find in step four above will not generally show up if you’re looking at the network interface statistics in your monitoring system because it will only last a few seconds, however it can still be the culprit.
Alternatives to Packet Sniffing
Experience with a packet sniffer is handy because it always works, regardless of the application’s specific technology. Unfortunately, like any generic tool that also means it can’t take advantage of a lot of domain knowledge. If you have a good reason to suspect you know what layer the problem is in or don’t feel comfortable jumping down to the wire right away, you can take advantage of a few other tools in specific circumstances
Web Server Traffic Logs
If you have the optional extra information being captured into the web server logs beyond the original NCSA spec it should include the time it takes to transfer the data to the client and the number of bytes transferred. This should be enough to either validate or exclude the link between the web browser and the server. You’re looking in the logs for just a few things:
- Repeated Request Patterns: Web browsers try really hard to not fail. They will automatically and quietly respond to redirect requests and some other HTTP status codes and attempt to authenticate before throwing in the towel. This will show up in the logs as a pattern of hits in rapid succession from the same client IP address. You may have a situation where a client is being sent through several redirects, or is getting a retryable HTTP error on the first hit.
- Response Time: Look at the number of bytes transferred and the time to transfer to the client as well as the total request time. Compare the time to transfer with the total transfer time to exclude the client to web server link as being the performance problem.
SQL Server Profiler
An alternate approach if your application is a heavy SQL application is to use the SQL Profiler to get a nicer view of what is happening at the SQL level. This is worth it if you have high confidence that the problem is going to be in evidence by inspecting the SQL commands executed by your software. If you aren’t sure, start with a network trace anyway because you can establish some degree of confidence quickly whether or not it’s a lower-level problem.
Side note for Developers
If you’re writing code that makes calls to the database, it’s worth it to run through your main use cases and use SQL Profiler to verify what is happening at the database level. I guarantee it’ll be an eye opener. In particular, watch for events that don’t necessarily cause your code to break but are signs things aren’t entirely right. For example:
- Excessive database connect/disconnect: You’d be surprised how expensive this can be. From a pure performance standpoint, you ideally want to see it reuse a pooled collection, make all of its calls for that request, and then be done. If you see a lot of poll collection resets or even worse real database connect and disconnect events this should be investigated.
- Database deadlocks: Many developers automatically retry database exceptions to handle the wide range of use cases where a temporary issue (such as a missing or unusable database connection) occurs. This can also generally recover from deadlocks, but deadlocks are a performance killer. You should investigate them every time.
- Unexpected calls: You should have a mental picture of what database calls are going to be made and how many rows should be returned from the queries (at least approximately). If you can optimize your code to reduce the number of calls, it’s most likely worth it. Each call will add linear time to your application which will tend to create performance issues. You can’t beat the speed of light.
Side note for DB Administrators
If configured correctly, the profiler has a very low load on the SQL Server. In the past, my IT team has gotten value out of periodically setting up a trace to run for a day in production looking for particularly problematic events like database deadlocks.
You have monitoring, right?
You can often eliminate a lot of possibilities if you have routine system metrics being captured. Check out our list of recommended metrics to capture. If you’ve done this, you can out of the box eliminate problems such as network interface saturation, processor, or memory utilization on the different servers and appliances in the system. If you’re having a performance problem and have one of these issues, you’ll want to start there. In the case of a network problem you will want to either do a network capture to see what is using the bandwidth or, if you have it, take advantage of SFlow or NetFlow to get to the same results without having to look at the packets.
Installing Wireshark
Wireshark works by installing a packet capture service and using it to intercept all of the traffic as it’s coming off the wire and going on. It uses a separate service on windows to do this – WinPcap. Because it installs as a system driver, you may legitimately be uncomfortable installing it on a production system that’s live. In my past experience, we’ve often left Ethereal installed locally on one node in each web cluster (and indeed in several cases just integrated it into our standard server installation) with no ill effects.
If you don’t want to run Wireshark & WinPcap on your production server, there is another option but it’s somewhat tricky to set up: Most commercial network switches support port mirroring that will allow you to configure one switch port to get all of the traffic received & sent by another port. You can use this then to set up an interface on your test system that is mirroring the network port used by the server you want to monitor. There are several downsides with this in my eyes: First, you’re changing the configuration of your production switches and possibly moving cables around (there are often restrictions on using port mirroring across switches, so depending on your physical hardware you will usually have to plug into a free port on the same switch as the server you want to monitor) which in many ways invites more human error potential than leaving another system driver installed. Second, if you forget to change the configuration back when you’re done and someone else plugs into that switch port they’re in for a surprise.
Credit where it’s due
I have to give Ingo Hammer credit for introducing the phrase the wire never lies in a presentation he gave at Tech Ed 2005. His presentation led me to have my team reorder our troubleshooting process to take what had been a late step and move it way up in the process – using a packet sniffer to see what was going on at the physical network level when troubleshooting system problems. Much of this article is an elaboration on what he talked about with my team’s experience added in. I can’t find a copy of his original presentation online, but if someone knows where it is posted (legally) let me know.
Tags: Ethereal, Packet sniffer, performance, SQL Server Profiler, wire never lies, Wireshark
Posted in Monitoring | No Comments »
How do you know? IT Monitoring for small & medium businesses.
Written by Kendall Miller on February 23, 2008 – 1:00 amSit back for a minute and ask yourself this question: How do you know?
- How do you know that your users are able to get to the services you provide, right now?
- How do you know that all of the hardware you’re responsible is working, right now?
- How do you know what you should be working on right now? Project work, or event-driven work? (Substitute in trouble ticket, help desk ticket, whatever for event-driven work)
The last question is in the family of questions centered on how to balance workload, which is another article. The first two questions are ones that you should be able to answer, and alternatively as important are hard to explain after the fact if you can’t answer. The key to these questions is being in the know – having mechanisms to ensure you know what’s working without requiring your active involvement. You need to have a comprehensive monitoring strategy, and most likely you can’t spend a great deal to get it done. The good news is that at a moderate scale (say up to 200 monitored devices) you shouldn’t have to.
When deciding how to monitor, we start with the questions we want to be sure we can answer. You want to go through this exercise to avoid being swept up in cool visuals, dashboards, and other golly gee whiz stuff that most smart monitoring vendors put in their systems. It’s not that these things are bad – far from it – but they don’t change the fact that you need to be sure your monitoring answers the essential questions. It is very unlikely that you’ll find a tool that answers all of the questions you have. This isn’t inherently a problem, but you are going to want to minimize the number of tools you have to work with because each has an operational cost.
The Essential Questions
In order, you want to be sure you know:
- Is everything working right now says the users? There is a distance between knowing that server is running and knowing that your users can access the services hosted on the server.
- Is anything about to go wrong that will cause an interruption in service? It could be a server about to run out of disk space, a non-redundant drive that’s reporting soft errors, etc.
- Are we using our resources effectively? You can’t count on users to report occasional glitches or when things are just slow. Can you balance resources or shift load to provide better performance with what you already have?
Ideally, you want to set up a system (which may be a collection of different pieces of software, all working together) to make sure you know the answers to these three questions without relying on the active participation of you or your team. To answer these questions you’ll need a combination of event monitoring (for the first question and part of the second) and metrics for part of the second and the third.
There is a significant sticking point to the first item above – are your services working in the eyes of your users? If a server is running along fine, but disappeared from DNS so no one can find it, it’s down. The service being able to respond is necessary, but not sufficient: Users will not give you credit because the problem was somewhere else, they really just care about outcomes. If they need to access a service and it doesn’t work when they try to access it, it’s down. This means when you’re looking at monitoring you want to think of how you’re going to cover the distance from where the users are all the way back to the servers that ultimately host the data. If you’re a small business, some of this you might get for free: You’re dependent on the same set of services your users are, so you’re interactively running the same basic set of validations they are. As your business scales up, you will need to think progressively more about how do you verify service delivery from the standpoint of the end-users. The most common way of doing this is through setting up probes of some type – software that acts like a user from the point of presence where users are and does a basic test of availability. This could be as simple as a ping from across the Internet (or, hopefully, something more substantial like getting a page and comparing it against a reference) or reading a file off a network share. If you can set up a probe to go from where your users are to your servers then you can answer question #1 by saying that if your probes show things work, you’re good. It isn’t 100%, but in most small and mid-sized shops its close enough.
A working approach – Alerts, Notification, and Diagnostics
When laying out your monitoring strategy, think about what are the alerts, notifications, and diagnostics you need to be sure you can answer the essential questions.
- Alerts: Also known as alarms, Alerts are designed to inform you whenever a business critical service isn’t working or will imminently fail. Alerts should go to your on-call staff, 24×7. If it isn’t something you’d resolve outside of business hours, it isn’t alert-worthy.
- Notification: Like an alert, your monitoring system should reach out and inform you about these events, but either the information is less severe or it isn’t a business critical service. Notifications generally don’t go to your on-call staff but instead to a regular queue to be reviewed during business hours.
- Diagnostics: Diagnostic monitoring helps you resolve problems quickly, avoid them if possible, and provide business optimization. This gets to question #3 on our list and can help with question #2.
One problem with most tools is that achieving a useful alert configuration is very difficult. They either generate alerts at the drop of a hat or don’t notify you of the most important things. A main reason for this challenge is that most don’t monitor your environment from the standpoint of your services. They instead look for events at the OS level and presume to know what they indicate at the service level. This is considerably simpler for the product to do because it doesn’t require any particular information about your business or environment, but it doesn’t give you a user’s view of your services.
In the end, with rare exception alerting based on operating system generic information doesn’t work well, the signal to noise ratio is generally not good enough. Instead, for the best quality alerts focus on service-based probing. The goal of alerts is to trigger your on-call staff to investigate and resolve an issue. They don’t need to be perfect in what they tell you; the goal isn’t to have the alert provide a detailed diagnostic, but rather to get a person engaged when and only when it is necessary. You should make every effort to ensure that alerts are successful. Ideally, you want them to depend on the least amount of infrastructure to work. For example, if possible avoid using any mail relay to send alerts to ensure that a local email outage doesn’t prevent you from receiving any alerts. For example, you may want to get an external email account that notifies the on-call cell phone/blackberry/whatever and also sends a notification back to your internal email system for archival purposes.
The same constraint doesn’t apply to notifications. Since these are not expected to be handled outside of normal hours, they don’t need to be resilient for email and other infrastructure failures. After all – your alert monitoring will tell you if your infrastructure services fail. Notification can be accomplished through a simple email distribution list or the like. The most important part of a notification mechanism is that it reaches out and get your attention without any users having to take action on their own.
For diagnostic monitoring you want to be able to capture and preserve a record of important events and metrics in your environment. For a discussion of recommended events and metrics, see Key Infrastructure Information to Capture. Of particular note, graphical metrics are great at helping diagnose problems that involve multiple systems, memory leaks, and capacity. For example, if you are tracking the free memory of each server then you can check if a particular problem corresponds to a time when the server had very little free memory. If the available memory on your servers forms a saw-tooth pattern with steady depletion then a spike back to normal you probably have a memory leak that will cause a range of issues, most of which will look like something else.
Monitoring Products
There are many, many products out there. If you’re interested in what I would recommend for a specific situation, please drop me a line. I’ve used a few products that have ranged from free to not very expensive to very expensive. If you want the best results, you are likely to spend some money – either writing some glue yourself or in purchasing a product or two. There is such a tremendous ROI on this that you shouldn’t be afraid to spend a little money even if your company has no history of purchasing IT tools. This is a good place to start a new tradition.
Event Monitoring
If you’ve got a homogeneous hardware environment with a major player (like Dell, IBM, or HP) they each offer a vendor-specific monitoring solution that will do a credible job of capturing events. In my experience, the products are not great at metrics. For the Windows environment, I’ve had better experience using Microsoft Operations Manager (now System Center Operations Manager 2007, because it needed a longer title to get better.) and then the vendor-specific management pack for hardware events. On the downside, whatever monitoring solution you pick you should invest in de-linting it: Hunting down and resolving each issue to keep the list of open items clean, ensuring you’ll react to the important items. In fairness, my team at a prior company found that this took so much time with MOM 2005 that it made it only worth looking at when they already believed there was a problem. That’s not a ringing endorsement. On the bright side, each management pack includes a lot of built-in knowledge from the developers that designed each product, and that knowledge can save you a great deal of time.
Capturing Metrics
If you don’t want to spend anything, but you have some time on your hands then you can do everything listed above with MRTG. On the downside, it takes some time and patience to set up, so I recommend for commercial environments other options as being more cost effective. We use PRTG which is extremely cost effective and works particularly well in a Windows environment. But, if you want to get it done and just can’t get anyone to fund a software purchase then it can get to anything exposed via SNMP.
Using MRTG
If you want to use MRTG, you’ll end up setting MRTG up to collect the sensor data from all the items you want to monitor. It outputs graphics and basic web pages that summarize these graphics. You’ll then want to create a few summary dashboard pages to be your overall summary. There are some tools to help you create your MRTG configuration file that are helpful. Once you’ve done this once and you have your first experience where Windows redoes the exact SNMP target of a network interface because the driver was reinstalled and you’ll be looking for another option – like PRTG which includes an SNMP helper class.
Using PRTG
We normally avoid directly mentioning commercial products and we never accept any form of compensatino for our references, but this one is very cost effective and does a great job in small and medium sized companies. PRTG from Paessler can offer a wealth of information rolled up in a user interface that’s fast on its feet so you can know at a glance from your iPhone if your network is really down or not, do capacity planning, and a range of other tasks. It lets you easily monitor all of the trends in your enviornment – free disk on each server, network volume on each interface, stability of your wide area network, firewall stats… whatever you can get at with SNMP.
All of the configuration is done through an easy to use web interface, and it’s pretty light on the server as well. You could get the functional results of PRTG on your own with MRTG and other open source tools… but where’s your time best spent? It’s probably easier to explain a few hundred dollars in software than spending days setting up and testing a monitoring solution, and then wondering if it’s working.
Prepare Now for the Long Run
Installing monitoring now, particularly to capture metrics of your environment, will pay off substantially down the road when you’re trying to understand a problem (by giving you baseline information to know what’s changed and what’s normal). It’s worth enough in time savings each day to be worth making the time – evenings and weekends if necessary – to get it running. You can start small by monitoring a few things and expand as it proves its value. If you don’t have experience with a particular software tool, I definitely recommend evaluating it in your shop for a period of at least 30 days but preferably a few months before laying down a lot of coin. In my experience, every ISV I’ve worked with has been willing to provide up to a 90 day evaluation key for their product to give you time. During the evaluation recognize that you aren’t a paying customer yet, so don’t go crazy with their tech support. Your goal is to identify if the product achieves the goals you identified at the beginning, and many won’t – they may be pretty but too slow to update or interact with for your comfort, or not be reliable under load, or require too much time to maintain. Find out before you part with your cash.
What tools and techniques have you used? How have they worked out? Post your comments or drop me a line to continue the conversation.
Tags: Metrics, Monitoring, MRTG, Operations Manager, PRTG
Posted in Monitoring | No Comments »