Posts Tagged ‘MRTG’
How do you know? IT Monitoring for small & medium businesses.
Written by Kendall Miller on February 23, 2008 – 1:00 amSit back for a minute and ask yourself this question: How do you know?
- How do you know that your users are able to get to the services you provide, right now?
- How do you know that all of the hardware you’re responsible is working, right now?
- How do you know what you should be working on right now? Project work, or event-driven work? (Substitute in trouble ticket, help desk ticket, whatever for event-driven work)
The last question is in the family of questions centered on how to balance workload, which is another article. The first two questions are ones that you should be able to answer, and alternatively as important are hard to explain after the fact if you can’t answer. The key to these questions is being in the know - having mechanisms to ensure you know what’s working without requiring your active involvement. You need to have a comprehensive monitoring strategy, and most likely you can’t spend a great deal to get it done. The good news is that at a moderate scale (say up to 200 monitored devices) you shouldn’t have to.
When deciding how to monitor, we start with the questions we want to be sure we can answer. You want to go through this exercise to avoid being swept up in cool visuals, dashboards, and other golly gee whiz stuff that most smart monitoring vendors put in their systems. It’s not that these things are bad - far from it - but they don’t change the fact that you need to be sure your monitoring answers the essential questions. It is very unlikely that you’ll find a tool that answers all of the questions you have. This isn’t inherently a problem, but you are going to want to minimize the number of tools you have to work with because each has an operational cost.
The Essential Questions
In order, you want to be sure you know:
- Is everything working right now says the users? There is a distance between knowing that server is running and knowing that your users can access the services hosted on the server.
- Is anything about to go wrong that will cause an interruption in service? It could be a server about to run out of disk space, a non-redundant drive that’s reporting soft errors, etc.
- Are we using our resources effectively? You can’t count on users to report occasional glitches or when things are just slow. Can you balance resources or shift load to provide better performance with what you already have?
Ideally, you want to set up a system (which may be a collection of different pieces of software, all working together) to make sure you know the answers to these three questions without relying on the active participation of you or your team. To answer these questions you’ll need a combination of event monitoring (for the first question and part of the second) and metrics for part of the second and the third.
There is a significant sticking point to the first item above - are your services working in the eyes of your users? If a server is running along fine, but disappeared from DNS so no one can find it, it’s down. The service being able to respond is necessary, but not sufficient: Users will not give you credit because the problem was somewhere else, they really just care about outcomes. If they need to access a service and it doesn’t work when they try to access it, it’s down. This means when you’re looking at monitoring you want to think of how you’re going to cover the distance from where the users are all the way back to the servers that ultimately host the data. If you’re a small business, some of this you might get for free: You’re dependent on the same set of services your users are, so you’re interactively running the same basic set of validations they are. As your business scales up, you will need to think progressively more about how do you verify service delivery from the standpoint of the end-users. The most common way of doing this is through setting up probes of some type - software that acts like a user from the point of presence where users are and does a basic test of availability. This could be as simple as a ping from across the Internet (or, hopefully, something more substantial like getting a page and comparing it against a reference) or reading a file off a network share. If you can set up a probe to go from where your users are to your servers then you can answer question #1 by saying that if your probes show things work, you’re good. It isn’t 100%, but in most small and mid-sized shops its close enough.
A working approach - Alerts, Notification, and Diagnostics
When laying out your monitoring strategy, think about what are the alerts, notifications, and diagnostics you need to be sure you can answer the essential questions.
- Alerts: Also known as alarms, Alerts are designed to inform you whenever a business critical service isn’t working or will imminently fail. Alerts should go to your on-call staff, 24×7. If it isn’t something you’d resolve outside of business hours, it isn’t alert-worthy.
- Notification: Like an alert, your monitoring system should reach out and inform you about these events, but either the information is less severe or it isn’t a business critical service. Notifications generally don’t go to your on-call staff but instead to a regular queue to be reviewed during business hours.
- Diagnostics: Diagnostic monitoring helps you resolve problems quickly, avoid them if possible, and provide business optimization. This gets to question #3 on our list and can help with question #2.
One problem with most tools is that achieving a useful alert configuration is very difficult. They either generate alerts at the drop of a hat or don’t notify you of the most important things. A main reason for this challenge is that most don’t monitor your environment from the standpoint of your services. They instead look for events at the OS level and presume to know what they indicate at the service level. This is considerably simpler for the product to do because it doesn’t require any particular information about your business or environment, but it doesn’t give you a user’s view of your services.
In the end, with rare exception alerting based on operating system generic information doesn’t work well, the signal to noise ratio is generally not good enough. Instead, for the best quality alerts focus on service-based probing. The goal of alerts is to trigger your on-call staff to investigate and resolve an issue. They don’t need to be perfect in what they tell you; the goal isn’t to have the alert provide a detailed diagnostic, but rather to get a person engaged when and only when it is necessary. You should make every effort to ensure that alerts are successful. Ideally, you want them to depend on the least amount of infrastructure to work. For example, if possible avoid using any mail relay to send alerts to ensure that a local email outage doesn’t prevent you from receiving any alerts. For example, you may want to get an external email account that notifies the on-call cell phone/blackberry/whatever and also sends a notification back to your internal email system for archival purposes.
The same constraint doesn’t apply to notifications. Since these are not expected to be handled outside of normal hours, they don’t need to be resilient for email and other infrastructure failures. After all - your alert monitoring will tell you if your infrastructure services fail. Notification can be accomplished through a simple email distribution list or the like. The most important part of a notification mechanism is that it reaches out and get your attention without any users having to take action on their own.
For diagnostic monitoring you want to be able to capture and preserve a record of important events and metrics in your environment. For a discussion of recommended events and metrics, see Key Infrastructure Information to Capture. Of particular note, graphical metrics are great at helping diagnose problems that involve multiple systems, memory leaks, and capacity. For example, if you are tracking the free memory of each server then you can check if a particular problem corresponds to a time when the server had very little free memory. If the available memory on your servers forms a saw-tooth pattern with steady depletion then a spike back to normal you probably have a memory leak that will cause a range of issues, most of which will look like something else.
Monitoring Products
There are many, many products out there. If you’re interested in what I would recommend for a specific situation, please drop me a line. I’ve used a few products that have ranged from free to not very expensive to very expensive. If you want the best results, you are likely to spend some money - either writing some glue yourself or in purchasing a product or two. There is such a tremendous ROI on this that you shouldn’t be afraid to spend a little money even if your company has no history of purchasing IT tools. This is a good place to start a new tradition.
Event Monitoring
If you’ve got a homogeneous hardware environment with a major player (like Dell, IBM, or HP) they each offer a vendor-specific monitoring solution that will do a credible job of capturing events. In my experience, the products are not great at metrics. For the Windows environment, I’ve had better experience using Microsoft Operations Manager (now System Center Operations Manager 2007, because it needed a longer title to get better.) and then the vendor-specific management pack for hardware events. On the downside, whatever monitoring solution you pick you should invest in de-linting it: Hunting down and resolving each issue to keep the list of open items clean, ensuring you’ll react to the important items. In fairness, my team at a prior company found that this took so much time with MOM 2005 that it made it only worth looking at when they already believed there was a problem. That’s not a ringing endorsement. On the bright side, each management pack includes a lot of built-in knowledge from the developers that designed each product, and that knowledge can save you a great deal of time.
Capturing Metrics
If you don’t want to spend anything, but you have some time on your hands then you can do everything listed above with MRTG. On the downside, it takes some time and patience to set up, so I recommend for commercial environments other options as being more cost effective. We use PRTG which is extremely cost effective and works particularly well in a Windows environment. But, if you want to get it done and just can’t get anyone to fund a software purchase then it can get to anything exposed via SNMP.
Using MRTG
If you want to use MRTG, you’ll end up setting MRTG up to collect the sensor data from all the items you want to monitor. It outputs graphics and basic web pages that summarize these graphics. You’ll then want to create a few summary dashboard pages to be your overall summary. There are some tools to help you create your MRTG configuration file that are helpful. Once you’ve done this once and you have your first experience where Windows redoes the exact SNMP target of a network interface because the driver was reinstalled and you’ll be looking for another option - like PRTG which includes an SNMP helper class.
Using PRTG
We normally avoid directly mentioning commercial products and we never accept any form of compensatino for our references, but this one is very cost effective and does a great job in small and medium sized companies. PRTG from Paessler can offer a wealth of information rolled up in a user interface that’s fast on its feet so you can know at a glance from your iPhone if your network is really down or not, do capacity planning, and a range of other tasks. It lets you easily monitor all of the trends in your enviornment - free disk on each server, network volume on each interface, stability of your wide area network, firewall stats… whatever you can get at with SNMP.
All of the configuration is done through an easy to use web interface, and it’s pretty light on the server as well. You could get the functional results of PRTG on your own with MRTG and other open source tools… but where’s your time best spent? It’s probably easier to explain a few hundred dollars in software than spending days setting up and testing a monitoring solution, and then wondering if it’s working.
Prepare Now for the Long Run
Installing monitoring now, particularly to capture metrics of your environment, will pay off substantially down the road when you’re trying to understand a problem (by giving you baseline information to know what’s changed and what’s normal). It’s worth enough in time savings each day to be worth making the time - evenings and weekends if necessary - to get it running. You can start small by monitoring a few things and expand as it proves its value. If you don’t have experience with a particular software tool, I definitely recommend evaluating it in your shop for a period of at least 30 days but preferably a few months before laying down a lot of coin. In my experience, every ISV I’ve worked with has been willing to provide up to a 90 day evaluation key for their product to give you time. During the evaluation recognize that you aren’t a paying customer yet, so don’t go crazy with their tech support. Your goal is to identify if the product achieves the goals you identified at the beginning, and many won’t - they may be pretty but too slow to update or interact with for your comfort, or not be reliable under load, or require too much time to maintain. Find out before you part with your cash.
What tools and techniques have you used? How have they worked out? Post your comments or drop me a line to continue the conversation.
Tags: Metrics, Monitoring, MRTG, Operations Manager, PRTG
Posted in Monitoring | No Comments »