Posts Tagged ‘Syslog’
Key Infrastructure Information to Capture
Written by Kendall Miller on February 21, 2008 – 11:28 pmThis article is a background reference for the important things to monitor in a small to mid-sized IT infrastructure. This information is largely independent of the tool or technology you use to capture it.
Server Monitoring
While this article specifically talks about what’s typical for a Windows environment, Linux and other variants of UNIX will have equivalent metrics that are generally useful as well. Note that application monitoring is distinct from server monitoring; each application will tend to have its own strengths and weaknesses for monitoring and should be considered separately. In this section, we’re looking at the operating system and hardware.
Metrics
- For each network interface: Note: If you can capture this at the network switch, then you don’t need to do it here.
- bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
- Interface speed: The current connection speed of the interface
- For each disk volume
- Free bytes: The number of bytes currently available.
- Queue Length: The number of pending IO requests.
- Memory
- Free bytes: The number of bytes currently available.
- Total bytes: The memory capacity in the system.
- Processor
- Utilization: Total processor utilization. On a UNIX system, substitute Load (which works entirely differently)
You can capture more metrics than this if you want; and if your capturing system can handle it without stress, for the most part then go wild. That said, put the above on a dashboard where you can easily see them because the goal of the dashboard is to give you a quick sense of overall good & bad and a benchmark for comparison when there are problems.
If the server is connected to a SAN, consider each Fibre Channel interface to be a network interface.
Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.
Event Monitoring
The most important thing to be able to do is capture hardware events, particularly from redundant hardware like RAID controllers, redundant power supplies, etc. If it’s redundant, it has to be monitored so you will know when it fails. Virtually all vendors will provide a mechanism for monitoring their hardware, but this is one area where the tier 1 server vendors do the best job. In particular, some like Dell and HP can integrate their hardware monitoring into the more common general monitoring solutions (like Microsoft Operations Manager) which gives you fewer pieces of infrastructure software to maintain.
Firewall Monitoring
Most firewalls are based on a UNIX derivative, very commonly Linux. There are several reasons for this, but the most salient typically are that you want something you can strip down to the bare minimum necessary to do the job and you don’t need or even want a user interface. This should be a dedicated appliance, and you don’t want to have hard disks in it either since they are a major point of failure and there just shouldn’t be a need. Additionally, if you’re an all-Windows shop there is value in having a small bit of heterogeneity in your environment: If your firewall is Linux and your web servers are Windows, it’s extraordinarily unlikely that a particular software defect exploit can work at both layers.
Metrics
Different firewalls support different detailed events, however if your firewall supports SNMP then you can probably combine its metrics with the server and network metrics together. If your firewall doesn’t support SNMP, you’ll want to have that on your feature list for the next one. There’s high value in having all of the basic infrastructure metrics in one place.
- For each network interface:
- bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
- Interface speed: The current connection speed of the interface
- Processor
- Utilization: Total processor utilization. On a UNIX system, substitute Load (which works entirely differently). Under the covers, your firewall probably runs a variant of UNIX.
Most firewalls also will have counters available for key firewall-specific security metrics such as connections and connection denies, however for the purposes of a dashboard it’s generally easier to drop into the firewall’s specific administrative tool to review what’s going on. Again, our purpose here is to create a dashboard with information that has the most value when looked at over time and is used to help isolate problems to specific nodes.
Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.
Event Monitoring
Most firewalls are based on UNIX (Linux in particular) so they tend to use the conventional UNIX logging facility: Syslog. If your firewall vendor doesn’t provide a dedicated logging collector and it supports syslog, purchase a syslog server package and install it on one of your servers. You should have at least one server (physical or virtual) that you have set aside for IT administrative purposes like this.
At a minimum, you want to collect a log message for every socket attempt that is denied by the firewall. This is very useful in diagnosing odd problems that don’t seem to have other explanations. I don’t recommend collecting each valid socket attempt because of the volume of information that represents.
Even if your firewall supports it and you have the capability to do so, I don’t recommend using SNMP for collecting these events. The volume can be very high at times as Internet worms and the like attempt to seek a hole in your firewall.
Network Switch Monitoring
There are several situations where you will want to be able to collect information directly from your network switches. Not every switch in your environment needs to support SNMP for collection; just the switch ports that are handling switch-to-switch traffic and ports where you have a server or network appliance that you can’t otherwise monitor. Depending on your switch hardware you will want to send these events to a Syslog server (if you have one) or as SNMP traps to an SNMP monitor. I don’t particularly recommend the latter because some events (like physical layer events) can get voluminous if you have switches that serve desktops.
Metrics
You want to be able to gather metrics on at least one side of each Inter-switch link to be able to troubleshoot capacity issues between switches and you want to be able to gather metrics on each shared device that you haven’t already covered directly via SNMP. Remember that when gathering statistics they will be “reversed” from the perspective of the switch compared to the devices: What is OUT from the device will be IN to the switch and vice versa. When monitoring a server or appliance at the switch side, I recommend labeling it as the device instead of the switch and reversing the direction labels so it is consistent with the rest of your devices.
You want to capture:
- For each monitored network interface:
- bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
- Interface speed: The current connection speed of the interface
Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.
Event Monitoring
When monitoring switches, I’ve found the most important events to capture are:
- Physical layer connect/disconnect: This will often highlight flaky cables and drivers, and situations where the switch and the server are auto-negotiating a port speed and failing. You did set your servers from auto-negotiate to manual for each port, right?
- Spanning Tree: Many problems in switches, particularly if you have a number of small switches interconnected, come down to spanning tree kicking in at unexpected times. If you can capture these events, it can help you correlate problems back to them.
Power Monitoring
If you are using an APC UPS or other similar device, get the SNMP network interface card. With this you can generally capture events and metrics back into your monitoring system for power events.
Metrics
You want to capture:
- Line Voltage In: The input voltage to the UPS
- Line Voltage Out: The voltage being fed to your servers. If this starts fluctuating, you have a problem with the UPS.
- Amps In: How many Amps of current the UPS is taking. If not available, look for a Watts or VA (Volt Amps) counter.
- Amps Out: How many Amps of current the UPS is taking. If not available, look for a Watts or VA (volt Amps) counter.
If available, also capture:
- Runtime available or Battery Capacity: Useful if you have power events to see how quickly your batteries are draining in a real load
- Battery temperature: UPS can experience high temperature swings when in use, and temperature is a killer of the lead acid batteries they typically use.
Event Monitoring
You want to monitor the following events into either Syslog or your SNMP monitor. The volume should be very low, so the SNMP monitor system is likely a better choice. You should also configure the UPS to email you when these events occur, particularly if you don’t have that set up for your SNMP monitor.
- Line Undervoltage: This captures a power outage (voltage goes to zero) and undervoltage due to line sags (typically too much load on the utility feed)
- On Battery/Off Battery: Each time it transitions to and from a battery for any reason. This may or may not be due to a utility problem.
How do we do it?
We follow our own advice - both at my previous startups and at eSymmetrix. Initially we set up MRTG to do our basic monitoring, but it was difficult to keep operating effectively, particularly with Windows which had a habit of changing the SNMP Id’s of network interfaces. After working with Microsoft Operations Manager it was just too slow at displaying useful metric information. We eventually found PRTG from a German company (www.paessler.com) and we’ve used it ourselves and recommended it to our clients. It’s pretty cheap, and most importantly includes an SNMP helper for windows that gets around a range of issues we had with MRTG. If you’re willing to trade a little money for a good savings in time, it’s a great tool.
Have another tool that’s worked well for you? Some other metrics that you think are must haves? Drop me a line and or leave a comment to let us know.
Tags: Metrics, Monitoring, PRTG, SNMP, Syslog
Posted in Monitoring | No Comments »