Latest Posts »
Latest Comments »
Popular Posts »

Walking the Walk - Gibraltar Moves You Down the Path

Written by Kendall Miller on June 19, 2009 – 3:29 am

kick it on DotNetKicks.com
If you’ve read more than one or two articles from Reliable Systems you probably have gotten the sense that we worry a lot about how to make things just work.   It’s that quality of anything where you get what you expect and what you need every time.  It can be in an experience (like a fun drive down a country road) or a product.  As a company if you can do this over and over you create a brand people develop a strong emotional connection to:  Apple, John Deere, Starbucks…

When you want to create a product that just works, you need to get all of the details right - from packaging through to maintenance and upkeep.  It’s not one thing that’s important, it’s all the things.  We are often engaged by senior management within a client when things aren’t working, and there’s conflicting opinions on why.  Usually along the path technology is being blamed: Not enough, not the latest thing, not someone’s favorite thing, not working.  As we dig into the situation, rarely is the technology the dominant factor:  More often, it’s how the technology is being integrated with the people and processes that all have to work together.

One of the first things we have to do in these engagements is to establish the real facts on the ground:  What exactly are the problems in the system, who’s doing what with it, how many times.  It comes down to establishing metrics to make sure time and attention are paid to the parts that make the biggest difference in the outcome.  Armed with these facts in a form the business can consume it’s possible to create plans of action that deliver virtually regardless of budget.

So let’s make this easier

The biggest trick is then getting the facts you need on an ongoing basis, easily, and in a form that the business can consume.  For over a decade we’ve been building instrumentation right into the systems we’ve worked on.  We’ve created a variety of toolkits to make this easier over the years, refining them as technology and our experience has changed.

About 18 months ago we decided it was time to really invest down this path.  We believe in routinely capturing key computer metrics along with whatever logging the application can do on its own.  We won’t do a project without using a great logging system that includes a strategy for managing runtime exceptions.   Now that we’re collecting all this data, we need to have a way of managing the raw data and turning it into valuable business data.

The challenge is that businesses don’t get up in the morning and say “what our customers want us to do is have great internal tools”, so you’re nearly always doing this on the cheap:  Borrowing time from development projects internally to cobble together various free or cheap solutions.  Frankly, we got tired of having to create new solutions with each client out of the margins of each project.  So, we pooled our best thinking from all of the work we’ve done (including a previous product that we did license to our clients over the past decade called CLAS) and started creating Gibraltar.

Rock Solid from Initial Release

With Gibraltar we wanted much more than a log system.  Of course, it had to be a log system too - and a really easy to use one that could work with each of our client applications.  More than that, it had to:

  • Automatically capture all of the performance metrics we wanted.
  • Integrate with existing logging available on the platform, including whatever a client might already be doing (like custom in-house options)
  • Be absolutely, positively, for sure safe to run in production no matter what.   That means it can’t ever use too much disk space or disk throughput or block the application.
  • Not use more than 5% of the performance of the app
  • Include all of the tools necessary to get data from where it was collected to the people that could get value out of it
  • Include the ability to look at the detailed session data up to high level analysis:  What’s the error rate?  What’s it correlate to?  Are we doing better or worse in this version?

From this initial sketch into everything we wanted, we’ve spent 18 months including four beta periods (from 2-4 months each) to refine the vision with real customers and real scenarios.  It was essential to us that this not be just a tool for techies but be ready for use by people with a wide range of skills.  It had to be pretty and just do what you wanted, when you wanted it to.

We’ve added a lot of capabilities along the way:  It can generate print-ready reports about application reliability that can communicate with senior management, you can define all kinds of custom metrics to easily track how your application is used and by whom.  We ran a number of betas to be sure that we had hit every goal we have above.  We’re happy to report that Gibraltar is in use within large deployments of custom applications, commercial applications, and small deployments right down to our corporate web site.

This tool isn’t for everyone - Our clients are nearly all Windows shops, and if they do any custom development it’s almost invariably in .NET, so that’s what we’ve targeted.   But, if you’re interested in easily getting real data on not just infrastructure (how well the application is running) but whether or not it just works, have we got an easy path for you.  You can see a quick demo video of how it works technically at Gibraltar Software.

You also don’t have to take my word for it at all, you can hear what one of our beta users did with it, which is really a more compelling story than what we might say.

I think you’ll find that our work sweating a lot of little details, from the exact design of the API and making sure the documentation was complete to rewriting our own licensing system to be very IT Admin friendly.  If we didn’t get a detail right, we want to know.  And the great news is that we’ve just begun:  We’re obsessed with the little things, and you can bet we’ll keep listening and watching to make it better.  Of course, this is made a lot easier because we’re using Gibraltar to monitor itself, and a select group of our users is sending that information back to us so we can make sure it just works in the field for real people.

It’s easy to start your journey

If you do development for Microsoft .NET, I’d encourage you to go over and download our commercial release of Gibraltar.  You’ll get great documentation, a free agent you can use like a flight recorder “black box” in every application you create, and a trial for a tool that will make you seem wise beyond your years.  And if you pay us the ultimate honor and purchase a permanent license, I can assure you that you won’t find anyone more committed to your satisfaction than we are.
kick it on DotNetKicks.com


Tags: , , , ,
Posted in Infrastructure, Monitoring, Software Development | No Comments »

Aviate, Navigate, Communicate

Written by Kendall Miller on March 27, 2008 – 12:29 am

If you’re involved in IT operations or even in business long enough, you’re going to experience some emergencies. During these emergencies, you’re going to have to balance several conflicting things that will demand your attention simultaneously:

  1. Cause of the problem: What is really happening? What device is at the root of the problem (network switch died because an admin configured a loop in the fabric and miss-configured the port)
  2. Scope of the problem: Just how bad is it? Problems usually show up in one place (users can’t access Exchange) but those symptoms often represent a larger problem (network switch died)
  3. Communicate with users: First, people will be coming in the door to report the problem (do you know that Exchange is down?) and will be expecting updates on what’s going on and when it’ll be resolved (I really need to tell my friend about a party tonight, when will email be back up?)

Even in a shop with healthy staffing, this can be a lot to handle at once particularly because your impulse is going to be to move between the root cause and communication. The first because it’s the real high value item -fix the problem. The last because whenever someone walks in, you’ll want to tell them what’s going on. The higher up the chain of command, the better you’ll want it to sound.

Whenever I’m wondering how to look at an IT Operations problem from a different perspective to gain insight, aviation is the first place I go. Think about the modern air transport system in the United States not from your usual perspective (a passenger on a plane) but from the standpoint of the people that live within it and operate it. For example, the life of a flight deck crew isn’t that different than system support in the sense that you have long periods of routine punctuated by periods of high stress activity. A classic rule taught to pilots when they’re first being trained is Aviate, Navigate, and Communicate - in that order.

  1. First, fly the plane. (Be in the middle of the air, not the bottom)
  2. Figure out where you are. (Over the White House)
  3. Then communicate. (Sorry Tower, would you like us to land?)

To make things easier on commercial planes, you have a pilot and co-pilot that divide these responsibilities by having clear designation of one being the Pilot Flying and the other (called the Pilot Not Flying or Pilot Monitoring) responsible for navigation and communication. This is practiced carefully during training with different parts of each emergency checklist assigned to either the Pilot Flying or Pilot Monitoring.

Now apply this back to a system problem:

  1. Create Clear Roles: Have your team know who is going to take on the role of Admin Flying and Admin Monitoring. This shouldn’t always be the same - it may be based simply on rotation (who is “up”) or who gets the trouble ticket or whatever within your shop. The team should declare their role in a situation so everyone knows their role.
  2. Perform in Order: If you have an Admin monitoring, it’s their role to intercept external communication while the Admin Flying is working on the problem.
  3. Make a Checklist: When there is an emergency isn’t the time to be winging it. During quiet moments, talk as a team about what you would do in a hypothetical situation and work to distill out a basic checklist of things you’re going to run through. Focus on having it be the shortest list that verifies the largest set of items. When a problem shows up, use the checklist.

Problem Checklists

There are a few great advantages to using a checklist for problems:

  • Reduce Solution Focus: When diagnosing problem, the general process is to propose a theory then test it to either prove or disprove it. This create cycles where you create theories you have to believe in then your job is to prove yourself wrong. It turns out that people tend to naturally bias towards information that proves themselves right and away from information that’s inconsistent with that diagnosis. Checklists for diagnostics can ensure that a significant breadth of information is available at the start of this process to enable the best theories to be created quickly.
  • Creates a Pace: It’s easy to get caught up in an emergency and start working at a pace that really isn’t necessary, but degrades your accuracy and effectiveness. Checklists stop the emotional cycle that reinforces the early stages of emergencies and instead create a steadily paced environment of gathering and verifying facts.
  • Establish a Baseline for Improvement: One of the most important parts of any emergency, and the least frequently used effectively, is an after action review. After you’re back up and everyone has calmed down, you want to learn as much as you can from what happened. The existence of a checklist creates a baseline for systematic (As opposed to random or by chance) improvement to your team’s ability to handle future problems. This is true even if the checklist wasn’t used; the fact it wasn’t used is itself an indictment of either the checklist itself or the team’s training.

While initially it may feel corny or even overly dramatic or bureaucratic to create checklists, there is real evidence to back up using them in environments where the downside cost (crash and death) is very steep, and if pressed to admit it most engineer will confess they have a mental checklist they use for standard problems.

Plans are Useless, Planning is Priceless.

Just by creating the checklists (even if they were never used) your team can get a lot of value:

  • Cooperative learning: This is a great tool for the team to learn from each other. Each admin will share their best tips and tricks from their mental checklist and be surprised that they don’t line up. Where they don’t, the discussion on which approach is better and why is gold. It’s hard to get the same result with a contrived exercise, so use this opportunity to build the checklist and maintain it as a team.
  • Clarifies Automation: While creating the checklist, it will naturally precipitate ideas for how to automatically identify and possibly solve steps in the checklist itself. For example, if a step in the checklist is to verify Internet connectivity, how are you going to accomplish that? Instead of having an ad-hoc mechanism, can an automated mechanism be put in place so that you now can quickly check that data point without variation?
  • Encourages Collaboration: If the team collaborates to create the checklist, when a problem occurs they will be more likely to collaborate on resolving the problem because they already have had the experience of working together as a team. This will tend to replace individual ego with group esprit de corps.

An Exercise Left to the Interested Student

A friend of mine also pointed out the principle that if you have a checklist that always ends in the same action, why not automate the action in response to the checklist? In other words, if you can automate the detection steps that lead up to the action, then find a way to automate the resolution. You will often find you get here in inches: You progressively improve your monitoring so that you can find problems faster. Once this is reliable, you start just hooking up alarms to the monitoring so you don’t wait for a call from a real user or a higher level system. Once that’s working well enough, you get tired of performing the resolution manually so you write a script that takes a few arguments to perform the resolution. Now, just connect them together.

Move Forward One Step Today

The best part about this is that you can get there in small steps that even the busiest team can fit into their schedule with a confidence that they will pay back in time saved in the future. With practice, it will become second nature and make it easier for your team to accommodate new processes and service requirements with ease. In the end, isn’t that what you need to ensure your team is viewed as a vital part of your organization?


Tags:
Posted in Management, Monitoring | No Comments »

The Wire Never Lies

Written by Kendall Miller on February 25, 2008 – 12:59 am

You need to find and resolve a problem with your web or multi-tier application, and you need to do it quickly. It may be happening in production or in a place where you can’t easily set up a test environment or get a traditional debugger involved. Here’s an approach that will help you narrow down and in many cases resolve the issue. The best part is that in most cases it won’t require specialized knowledge of the language the application is written in.

Don’t be afraid to pick up a packet sniffer and look at the actual Ethernet packets running back and forth between the parts of your system. You’ll probably find the issue much more quickly than you think, and you can do this with an application in production without the original source code, at least enough to know what your options are. The wire never lies - it tells you exactly what your application is really doing over the network.

For the purposes of this article, consider a basic web application. It most likely has a set of code that runs on the web server (which could be in any language) and then talks with a back-end database, probably located on a different system if this is a large web application. Now take two common categories of problems: A performance issue and an occasional web site error.

Our basic approach is consistently:

  1. Find the layer of the architecture where the problem is being introduced by tracing the network
  2. Dissect what is happening in that layer down to the process that is introducing the problem.
  3. Review the implementation of just the affected commands in the suspect process to resolve the issue.

Our first goal is to narrow down what layer of the system is the most likely culprit - the web application or the database. When doing this, I’ve found that it pays to quickly pull out a tool that will tell me what’s going on across the network. This is where the wire never lies comes from: If you use a packet sniffer or some other tool to see what’s happening “on the wire”, you will know exactly what is going on between your network layers. Not what you think should happen or want to happen - what is actually happening. This is so important because we develop in a world with many layers of abstraction between what we write and the physical I/O commands that ultimately carry out our wishes.

Let’s start with an example of a performance problem, described as being that a user viewing a detail page in your web application is experiencing that it takes several seconds to display, and they believe it is getting slower over time.

Find the problem layer

In our example, we have several possibilities: The user establishes a connection from their web browser to the web server which in turn makes database calls to the database server. If clustering is involved it is somewhat more complicated because with a cluster it likely goes web browser to load balancing appliance to web server to database. Regardless, our first goal is to narrow down what layer of the architecture the problem is being introduced.

In the case of a performance problem, the layer that introduces the problem is the first layer that is taking up the majority of the time and not waiting on another layer. The quickest way to resolve this is to do some strategic network sniffing at key points in your infrastructure to watch the request be processed. This may not seem quick, but with practice it becomes very natural.

A good place to start is on the web server. In many cases sniffing the traffic at the web server alone is sufficient to find the entire problem because it sees the traffic to & from the web browser and upstream to the database server. You can use a variety of tools to do this, but I like Wireshark. It’s free, fast, and very capable. Microsoft also ships a basic network monitor, but it doesn’t have some of the neat-o features Wireshark has that make analysis quick. Until recently, Wireshark was called “Ethereal” but that name had to be changed due to copyright problems.

What we’re looking for is to compare the traffic to & from the web browser and what’s traveling off of the web server. We want to compare timings and volumes to understand what happens between when the web requests starts and when it completes. Do a complete packet capture of one problem web request, then get ready to spend some time understanding it.

The first thing you’ll likely notice is that there is a great deal more information here than you likely expected. Even a simple HTTP Get request results in a lot more network traffic than you might expect. If your site uses SSL, you’ll also discover that in fact the traffic to and from your web browser is encrypted - remember, we’re looking at what’s going on at Layer 2 of the network, so this is a good thing. If you’re using encryption within your own data center from the web server to the database server this is going to really get in your way (and you should ask yourself why you’re doing that as a general practice, but that’s another article). If your web site uses content compression the response will also look encrypted.

When analyzing a trace, do the following:

  1. Eliminate spurious client traffic: Filter out requests that aren’t from your test client. If they are part of the problem, it will generally still show up in calls the web server is making to the database or other systems, and you don’t need the volume.
  2. Narrow down the time window: You probably started the trace a few seconds before your hit, and ended it a few seconds after. Look for the first packet from your client’s IP address and eliminate everything before it, likewise look for the last packet To your client’s IP address and eliminate everything after.
  3. Look at timing: You want to survey the sequence of events to get a feel for what happened exactly in order. Your primary concern is going to be traffic you know could be related (such as to your SQL server) but don’t ignore authentication traffic, it can be a secret performance killer (time spent negotiating security between your web application and another server). Time spent on other servers will show as a quiet spot in the sequence - where a request has been sent off but the response hasn’t come back yet. Note that you need to be reading the timestamps to get a good feel for this; a lot of packets isn’t necessarily a bad thing - networks are very fast in general. If all the packets are happening in the space of 20 milliseconds, it isn’t your performance problem.
  4. Look at the volume: A quick way to get a feel for this is to use the ability for the packet sniffer to reassemble packets into a stream. This shows you the true conversation that is going on between the layers, and will show you how many bytes were moved to get it done. This is very helpful if you discover, for example, that you’re passing back very large recordsets you didn’t expect. Alternately, it could be that the data is simply inefficiently stored or packaged. For example, if a column in the database is configured for Unicode and the caller requests it in that format, it will take twice as much data across the wire to move it. XML data in the database can also get you in trouble by causing unexpectedly high volume.
  5. Look at the detail: If the problem isn’t apparent yet, look at the specific requests being made. For example, you may notice repeated requests that may indicate an error/retry cycle in the application.

It’s worth pointing out that a network volume problem like you would in find in step four above will not generally show up if you’re looking at the network interface statistics in your monitoring system because it will only last a few seconds, however it can still be the culprit.

Alternatives to Packet Sniffing

Experience with a packet sniffer is handy because it always works, regardless of the application’s specific technology. Unfortunately, like any generic tool that also means it can’t take advantage of a lot of domain knowledge. If you have a good reason to suspect you know what layer the problem is in or don’t feel comfortable jumping down to the wire right away, you can take advantage of a few other tools in specific circumstances

Web Server Traffic Logs

If you have the optional extra information being captured into the web server logs beyond the original NCSA spec it should include the time it takes to transfer the data to the client and the number of bytes transferred. This should be enough to either validate or exclude the link between the web browser and the server. You’re looking in the logs for just a few things:

  1. Repeated Request Patterns: Web browsers try really hard to not fail. They will automatically and quietly respond to redirect requests and some other HTTP status codes and attempt to authenticate before throwing in the towel. This will show up in the logs as a pattern of hits in rapid succession from the same client IP address. You may have a situation where a client is being sent through several redirects, or is getting a retryable HTTP error on the first hit.
  2. Response Time: Look at the number of bytes transferred and the time to transfer to the client as well as the total request time. Compare the time to transfer with the total transfer time to exclude the client to web server link as being the performance problem.

SQL Server Profiler

An alternate approach if your application is a heavy SQL application is to use the SQL Profiler to get a nicer view of what is happening at the SQL level. This is worth it if you have high confidence that the problem is going to be in evidence by inspecting the SQL commands executed by your software. If you aren’t sure, start with a network trace anyway because you can establish some degree of confidence quickly whether or not it’s a lower-level problem.

Side note for Developers

If you’re writing code that makes calls to the database, it’s worth it to run through your main use cases and use SQL Profiler to verify what is happening at the database level. I guarantee it’ll be an eye opener. In particular, watch for events that don’t necessarily cause your code to break but are signs things aren’t entirely right. For example:

  • Excessive database connect/disconnect: You’d be surprised how expensive this can be. From a pure performance standpoint, you ideally want to see it reuse a pooled collection, make all of its calls for that request, and then be done. If you see a lot of poll collection resets or even worse real database connect and disconnect events this should be investigated.
  • Database deadlocks: Many developers automatically retry database exceptions to handle the wide range of use cases where a temporary issue (such as a missing or unusable database connection) occurs. This can also generally recover from deadlocks, but deadlocks are a performance killer. You should investigate them every time.
  • Unexpected calls: You should have a mental picture of what database calls are going to be made and how many rows should be returned from the queries (at least approximately). If you can optimize your code to reduce the number of calls, it’s most likely worth it. Each call will add linear time to your application which will tend to create performance issues. You can’t beat the speed of light.

Side note for DB Administrators

If configured correctly, the profiler has a very low load on the SQL Server. In the past, my IT team has gotten value out of periodically setting up a trace to run for a day in production looking for particularly problematic events like database deadlocks.

You have monitoring, right?

You can often eliminate a lot of possibilities if you have routine system metrics being captured. Check out our list of recommended metrics to capture. If you’ve done this, you can out of the box eliminate problems such as network interface saturation, processor, or memory utilization on the different servers and appliances in the system. If you’re having a performance problem and have one of these issues, you’ll want to start there. In the case of a network problem you will want to either do a network capture to see what is using the bandwidth or, if you have it, take advantage of SFlow or NetFlow to get to the same results without having to look at the packets.

Installing Wireshark

Wireshark works by installing a packet capture service and using it to intercept all of the traffic as it’s coming off the wire and going on. It uses a separate service on windows to do this - WinPcap. Because it installs as a system driver, you may legitimately be uncomfortable installing it on a production system that’s live. In my past experience, we’ve often left Ethereal installed locally on one node in each web cluster (and indeed in several cases just integrated it into our standard server installation) with no ill effects.

If you don’t want to run Wireshark & WinPcap on your production server, there is another option but it’s somewhat tricky to set up: Most commercial network switches support port mirroring that will allow you to configure one switch port to get all of the traffic received & sent by another port. You can use this then to set up an interface on your test system that is mirroring the network port used by the server you want to monitor. There are several downsides with this in my eyes: First, you’re changing the configuration of your production switches and possibly moving cables around (there are often restrictions on using port mirroring across switches, so depending on your physical hardware you will usually have to plug into a free port on the same switch as the server you want to monitor) which in many ways invites more human error potential than leaving another system driver installed. Second, if you forget to change the configuration back when you’re done and someone else plugs into that switch port they’re in for a surprise.

Credit where it’s due

I have to give Ingo Hammer credit for introducing the phrase the wire never lies in a presentation he gave at Tech Ed 2005. His presentation led me to have my team reorder our troubleshooting process to take what had been a late step and move it way up in the process - using a packet sniffer to see what was going on at the physical network level when troubleshooting system problems. Much of this article is an elaboration on what he talked about with my team’s experience added in. I can’t find a copy of his original presentation online, but if someone knows where it is posted (legally) let me know.


Tags: , , , , ,
Posted in Monitoring | No Comments »

How do you know? IT Monitoring for small & medium businesses.

Written by Kendall Miller on February 23, 2008 – 1:00 am

Sit back for a minute and ask yourself this question: How do you know?

  • How do you know that your users are able to get to the services you provide, right now?
  • How do you know that all of the hardware you’re responsible is working, right now?
  • How do you know what you should be working on right now? Project work, or event-driven work? (Substitute in trouble ticket, help desk ticket, whatever for event-driven work)

The last question is in the family of questions centered on how to balance workload, which is another article. The first two questions are ones that you should be able to answer, and alternatively as important are hard to explain after the fact if you can’t answer. The key to these questions is being in the know - having mechanisms to ensure you know what’s working without requiring your active involvement. You need to have a comprehensive monitoring strategy, and most likely you can’t spend a great deal to get it done. The good news is that at a moderate scale (say up to 200 monitored devices) you shouldn’t have to.

When deciding how to monitor, we start with the questions we want to be sure we can answer. You want to go through this exercise to avoid being swept up in cool visuals, dashboards, and other golly gee whiz stuff that most smart monitoring vendors put in their systems. It’s not that these things are bad - far from it - but they don’t change the fact that you need to be sure your monitoring answers the essential questions. It is very unlikely that you’ll find a tool that answers all of the questions you have. This isn’t inherently a problem, but you are going to want to minimize the number of tools you have to work with because each has an operational cost.

The Essential Questions

In order, you want to be sure you know:

  1. Is everything working right now says the users? There is a distance between knowing that server is running and knowing that your users can access the services hosted on the server.
  2. Is anything about to go wrong that will cause an interruption in service? It could be a server about to run out of disk space, a non-redundant drive that’s reporting soft errors, etc.
  3. Are we using our resources effectively? You can’t count on users to report occasional glitches or when things are just slow. Can you balance resources or shift load to provide better performance with what you already have?

Ideally, you want to set up a system (which may be a collection of different pieces of software, all working together) to make sure you know the answers to these three questions without relying on the active participation of you or your team. To answer these questions you’ll need a combination of event monitoring (for the first question and part of the second) and metrics for part of the second and the third.

There is a significant sticking point to the first item above - are your services working in the eyes of your users? If a server is running along fine, but disappeared from DNS so no one can find it, it’s down. The service being able to respond is necessary, but not sufficient: Users will not give you credit because the problem was somewhere else, they really just care about outcomes. If they need to access a service and it doesn’t work when they try to access it, it’s down. This means when you’re looking at monitoring you want to think of how you’re going to cover the distance from where the users are all the way back to the servers that ultimately host the data. If you’re a small business, some of this you might get for free: You’re dependent on the same set of services your users are, so you’re interactively running the same basic set of validations they are. As your business scales up, you will need to think progressively more about how do you verify service delivery from the standpoint of the end-users. The most common way of doing this is through setting up probes of some type - software that acts like a user from the point of presence where users are and does a basic test of availability. This could be as simple as a ping from across the Internet (or, hopefully, something more substantial like getting a page and comparing it against a reference) or reading a file off a network share. If you can set up a probe to go from where your users are to your servers then you can answer question #1 by saying that if your probes show things work, you’re good. It isn’t 100%, but in most small and mid-sized shops its close enough.

A working approach - Alerts, Notification, and Diagnostics

When laying out your monitoring strategy, think about what are the alerts, notifications, and diagnostics you need to be sure you can answer the essential questions.

  • Alerts: Also known as alarms, Alerts are designed to inform you whenever a business critical service isn’t working or will imminently fail. Alerts should go to your on-call staff, 24×7. If it isn’t something you’d resolve outside of business hours, it isn’t alert-worthy.
  • Notification: Like an alert, your monitoring system should reach out and inform you about these events, but either the information is less severe or it isn’t a business critical service. Notifications generally don’t go to your on-call staff but instead to a regular queue to be reviewed during business hours.
  • Diagnostics: Diagnostic monitoring helps you resolve problems quickly, avoid them if possible, and provide business optimization. This gets to question #3 on our list and can help with question #2.

One problem with most tools is that achieving a useful alert configuration is very difficult. They either generate alerts at the drop of a hat or don’t notify you of the most important things. A main reason for this challenge is that most don’t monitor your environment from the standpoint of your services. They instead look for events at the OS level and presume to know what they indicate at the service level. This is considerably simpler for the product to do because it doesn’t require any particular information about your business or environment, but it doesn’t give you a user’s view of your services.

In the end, with rare exception alerting based on operating system generic information doesn’t work well, the signal to noise ratio is generally not good enough. Instead, for the best quality alerts focus on service-based probing. The goal of alerts is to trigger your on-call staff to investigate and resolve an issue. They don’t need to be perfect in what they tell you; the goal isn’t to have the alert provide a detailed diagnostic, but rather to get a person engaged when and only when it is necessary. You should make every effort to ensure that alerts are successful. Ideally, you want them to depend on the least amount of infrastructure to work. For example, if possible avoid using any mail relay to send alerts to ensure that a local email outage doesn’t prevent you from receiving any alerts. For example, you may want to get an external email account that notifies the on-call cell phone/blackberry/whatever and also sends a notification back to your internal email system for archival purposes.

The same constraint doesn’t apply to notifications. Since these are not expected to be handled outside of normal hours, they don’t need to be resilient for email and other infrastructure failures. After all - your alert monitoring will tell you if your infrastructure services fail. Notification can be accomplished through a simple email distribution list or the like. The most important part of a notification mechanism is that it reaches out and get your attention without any users having to take action on their own.

For diagnostic monitoring you want to be able to capture and preserve a record of important events and metrics in your environment. For a discussion of recommended events and metrics, see Key Infrastructure Information to Capture. Of particular note, graphical metrics are great at helping diagnose problems that involve multiple systems, memory leaks, and capacity. For example, if you are tracking the free memory of each server then you can check if a particular problem corresponds to a time when the server had very little free memory. If the available memory on your servers forms a saw-tooth pattern with steady depletion then a spike back to normal you probably have a memory leak that will cause a range of issues, most of which will look like something else.

Monitoring Products

There are many, many products out there. If you’re interested in what I would recommend for a specific situation, please drop me a line. I’ve used a few products that have ranged from free to not very expensive to very expensive. If you want the best results, you are likely to spend some money - either writing some glue yourself or in purchasing a product or two. There is such a tremendous ROI on this that you shouldn’t be afraid to spend a little money even if your company has no history of purchasing IT tools. This is a good place to start a new tradition.

Event Monitoring

If you’ve got a homogeneous hardware environment with a major player (like Dell, IBM, or HP) they each offer a vendor-specific monitoring solution that will do a credible job of capturing events. In my experience, the products are not great at metrics. For the Windows environment, I’ve had better experience using Microsoft Operations Manager (now System Center Operations Manager 2007, because it needed a longer title to get better.) and then the vendor-specific management pack for hardware events. On the downside, whatever monitoring solution you pick you should invest in de-linting it: Hunting down and resolving each issue to keep the list of open items clean, ensuring you’ll react to the important items. In fairness, my team at a prior company found that this took so much time with MOM 2005 that it made it only worth looking at when they already believed there was a problem. That’s not a ringing endorsement. On the bright side, each management pack includes a lot of built-in knowledge from the developers that designed each product, and that knowledge can save you a great deal of time.

Capturing Metrics

If you don’t want to spend anything, but you have some time on your hands then you can do everything listed above with MRTG. On the downside, it takes some time and patience to set up, so I recommend for commercial environments other options as being more cost effective.  We use PRTG which is extremely cost effective and works particularly well in a Windows environment.  But, if you want to get it done and just can’t get anyone to fund a software purchase then it can get to anything exposed via SNMP.

Using MRTG

If you want to use MRTG, you’ll end up setting MRTG up to collect the sensor data from all the items you want to monitor. It outputs graphics and basic web pages that summarize these graphics. You’ll then want to create a few summary dashboard pages to be your overall summary. There are some tools to help you create your MRTG configuration file that are helpful. Once you’ve done this once and you have your first experience where Windows redoes the exact SNMP target of a network interface because the driver was reinstalled and you’ll be looking for another option - like PRTG which includes an SNMP helper class.

Using PRTG

We normally avoid directly mentioning commercial products and we never accept any form of compensatino for our references, but this one is very cost effective and does a great job in small and medium sized companies.  PRTG from Paessler can offer a wealth of information rolled up in a user interface that’s fast on its feet so you can know at a glance from your iPhone if your network is really down or not, do capacity planning, and a range of other tasks.  It lets you easily monitor all of the trends in your enviornment - free disk on each server, network volume on each interface, stability of your wide area network, firewall stats… whatever you can get at with SNMP. 

All of the configuration is done through an easy to use web interface, and it’s pretty light on the server as well.  You could get the functional results of PRTG on your own with MRTG and other open source tools…  but where’s your time best spent?  It’s probably easier to explain a few hundred dollars in software than spending days setting up and testing a monitoring solution, and then wondering if it’s working. 

Prepare Now for the Long Run

Installing monitoring now, particularly to capture metrics of your environment, will pay off substantially down the road when you’re trying to understand a problem (by giving you baseline information to know what’s changed and what’s normal). It’s worth enough in time savings each day to be worth making the time - evenings and weekends if necessary - to get it running. You can start small by monitoring a few things and expand as it proves its value. If you don’t have experience with a particular software tool, I definitely recommend evaluating it in your shop for a period of at least 30 days but preferably a few months before laying down a lot of coin. In my experience, every ISV I’ve worked with has been willing to provide up to a 90 day evaluation key for their product to give you time. During the evaluation recognize that you aren’t a paying customer yet, so don’t go crazy with their tech support. Your goal is to identify if the product achieves the goals you identified at the beginning, and many won’t - they may be pretty but too slow to update or interact with for your comfort, or not be reliable under load, or require too much time to maintain. Find out before you part with your cash.

What tools and techniques have you used? How have they worked out? Post your comments or drop me a line to continue the conversation.


Tags: , , , ,
Posted in Monitoring | No Comments »

Key Infrastructure Information to Capture

Written by Kendall Miller on February 21, 2008 – 11:28 pm

This article is a background reference for the important things to monitor in a small to mid-sized IT infrastructure. This information is largely independent of the tool or technology you use to capture it.

Server Monitoring

While this article specifically talks about what’s typical for a Windows environment, Linux and other variants of UNIX will have equivalent metrics that are generally useful as well. Note that application monitoring is distinct from server monitoring; each application will tend to have its own strengths and weaknesses for monitoring and should be considered separately. In this section, we’re looking at the operating system and hardware.

Metrics

  • For each network interface: Note: If you can capture this at the network switch, then you don’t need to do it here.
    • bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
    • Interface speed: The current connection speed of the interface
  • For each disk volume
    • Free bytes: The number of bytes currently available.
    • Queue Length: The number of pending IO requests.
  • Memory
    • Free bytes: The number of bytes currently available.
    • Total bytes: The memory capacity in the system.
  • Processor
    • Utilization: Total processor utilization. On a UNIX system, substitute Load (which works entirely differently)

You can capture more metrics than this if you want; and if your capturing system can handle it without stress, for the most part then go wild. That said, put the above on a dashboard where you can easily see them because the goal of the dashboard is to give you a quick sense of overall good & bad and a benchmark for comparison when there are problems.

If the server is connected to a SAN, consider each Fibre Channel interface to be a network interface.

Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.

Event Monitoring

The most important thing to be able to do is capture hardware events, particularly from redundant hardware like RAID controllers, redundant power supplies, etc. If it’s redundant, it has to be monitored so you will know when it fails. Virtually all vendors will provide a mechanism for monitoring their hardware, but this is one area where the tier 1 server vendors do the best job. In particular, some like Dell and HP can integrate their hardware monitoring into the more common general monitoring solutions (like Microsoft Operations Manager) which gives you fewer pieces of infrastructure software to maintain.

Firewall Monitoring

Most firewalls are based on a UNIX derivative, very commonly Linux. There are several reasons for this, but the most salient typically are that you want something you can strip down to the bare minimum necessary to do the job and you don’t need or even want a user interface. This should be a dedicated appliance, and you don’t want to have hard disks in it either since they are a major point of failure and there just shouldn’t be a need. Additionally, if you’re an all-Windows shop there is value in having a small bit of heterogeneity in your environment: If your firewall is Linux and your web servers are Windows, it’s extraordinarily unlikely that a particular software defect exploit can work at both layers.

Metrics

Different firewalls support different detailed events, however if your firewall supports SNMP then you can probably combine its metrics with the server and network metrics together. If your firewall doesn’t support SNMP, you’ll want to have that on your feature list for the next one. There’s high value in having all of the basic infrastructure metrics in one place.

  • For each network interface:
    • bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
    • Interface speed: The current connection speed of the interface
  • Processor
    • Utilization: Total processor utilization. On a UNIX system, substitute Load (which works entirely differently). Under the covers, your firewall probably runs a variant of UNIX.

Most firewalls also will have counters available for key firewall-specific security metrics such as connections and connection denies, however for the purposes of a dashboard it’s generally easier to drop into the firewall’s specific administrative tool to review what’s going on. Again, our purpose here is to create a dashboard with information that has the most value when looked at over time and is used to help isolate problems to specific nodes.

Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.

Event Monitoring

Most firewalls are based on UNIX (Linux in particular) so they tend to use the conventional UNIX logging facility: Syslog. If your firewall vendor doesn’t provide a dedicated logging collector and it supports syslog, purchase a syslog server package and install it on one of your servers. You should have at least one server (physical or virtual) that you have set aside for IT administrative purposes like this.

At a minimum, you want to collect a log message for every socket attempt that is denied by the firewall. This is very useful in diagnosing odd problems that don’t seem to have other explanations. I don’t recommend collecting each valid socket attempt because of the volume of information that represents.

Even if your firewall supports it and you have the capability to do so, I don’t recommend using SNMP for collecting these events. The volume can be very high at times as Internet worms and the like attempt to seek a hole in your firewall.

Network Switch Monitoring

There are several situations where you will want to be able to collect information directly from your network switches. Not every switch in your environment needs to support SNMP for collection; just the switch ports that are handling switch-to-switch traffic and ports where you have a server or network appliance that you can’t otherwise monitor. Depending on your switch hardware you will want to send these events to a Syslog server (if you have one) or as SNMP traps to an SNMP monitor. I don’t particularly recommend the latter because some events (like physical layer events) can get voluminous if you have switches that serve desktops.

Metrics

You want to be able to gather metrics on at least one side of each Inter-switch link to be able to troubleshoot capacity issues between switches and you want to be able to gather metrics on each shared device that you haven’t already covered directly via SNMP. Remember that when gathering statistics they will be “reversed” from the perspective of the switch compared to the devices: What is OUT from the device will be IN to the switch and vice versa. When monitoring a server or appliance at the switch side, I recommend labeling it as the device instead of the switch and reversing the direction labels so it is consistent with the rest of your devices.

You want to capture:

  • For each monitored network interface:
    • bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
    • Interface speed: The current connection speed of the interface

Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.

Event Monitoring

When monitoring switches, I’ve found the most important events to capture are:

  • Physical layer connect/disconnect: This will often highlight flaky cables and drivers, and situations where the switch and the server are auto-negotiating a port speed and failing. You did set your servers from auto-negotiate to manual for each port, right?
  • Spanning Tree: Many problems in switches, particularly if you have a number of small switches interconnected, come down to spanning tree kicking in at unexpected times. If you can capture these events, it can help you correlate problems back to them.

Power Monitoring

If you are using an APC UPS or other similar device, get the SNMP network interface card. With this you can generally capture events and metrics back into your monitoring system for power events.

Metrics

You want to capture:

  • Line Voltage In: The input voltage to the UPS
  • Line Voltage Out: The voltage being fed to your servers. If this starts fluctuating, you have a problem with the UPS.
  • Amps In: How many Amps of current the UPS is taking. If not available, look for a Watts or VA (Volt Amps) counter.
  • Amps Out: How many Amps of current the UPS is taking. If not available, look for a Watts or VA (volt Amps) counter.

If available, also capture:

  • Runtime available or Battery Capacity: Useful if you have power events to see how quickly your batteries are draining in a real load
  • Battery temperature: UPS can experience high temperature swings when in use, and temperature is a killer of the lead acid batteries they typically use.

Event Monitoring

You want to monitor the following events into either Syslog or your SNMP monitor. The volume should be very low, so the SNMP monitor system is likely a better choice. You should also configure the UPS to email you when these events occur, particularly if you don’t have that set up for your SNMP monitor.

  • Line Undervoltage: This captures a power outage (voltage goes to zero) and undervoltage due to line sags (typically too much load on the utility feed)
  • On Battery/Off Battery: Each time it transitions to and from a battery for any reason. This may or may not be due to a utility problem.

How do we do it?

We follow our own advice - both at my previous startups and at eSymmetrix.  Initially we set up MRTG to do our basic monitoring, but it was difficult to keep operating effectively, particularly with Windows which had a habit of changing the SNMP Id’s of network interfaces.  After working with Microsoft Operations Manager it was just too slow at displaying useful metric information.  We eventually found PRTG from a German company (www.paessler.com) and we’ve used it ourselves and recommended it to our clients.  It’s pretty cheap, and most importantly includes an SNMP helper for windows that gets around a range of issues we had with MRTG.  If you’re willing to trade a little money for a good savings in time, it’s a great tool. 

Have another tool that’s worked well for you?  Some other metrics that you think are must haves?  Drop me a line and or leave a comment to let us know.


Tags: , , , ,
Posted in Monitoring | No Comments »