Archive for the ‘Infrastructure’ Category
Key Infrastructure Information to Capture
Written by Kendall Miller on February 21, 2008 – 11:28 pmThis article is a background reference for the important things to monitor in a small to mid-sized IT infrastructure. This information is largely independent of the tool or technology you use to capture it.
Server Monitoring
While this article specifically talks about what’s typical for a Windows environment, Linux and other variants of UNIX will have equivalent metrics that are generally useful as well. Note that application monitoring is distinct from server monitoring; each application will tend to have its own strengths and weaknesses for monitoring and should be considered separately. In this section, we’re looking at the operating system and hardware.
Metrics
- For each network interface: Note: If you can capture this at the network switch, then you don’t need to do it here.
- bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
- Interface speed: The current connection speed of the interface
- For each disk volume
- Free bytes: The number of bytes currently available.
- Queue Length: The number of pending IO requests.
- Memory
- Free bytes: The number of bytes currently available.
- Total bytes: The memory capacity in the system.
- Processor
- Utilization: Total processor utilization. On a UNIX system, substitute Load (which works entirely differently)
You can capture more metrics than this if you want; and if your capturing system can handle it without stress, for the most part then go wild. That said, put the above on a dashboard where you can easily see them because the goal of the dashboard is to give you a quick sense of overall good & bad and a benchmark for comparison when there are problems.
If the server is connected to a SAN, consider each Fibre Channel interface to be a network interface.
Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.
Event Monitoring
The most important thing to be able to do is capture hardware events, particularly from redundant hardware like RAID controllers, redundant power supplies, etc. If it’s redundant, it has to be monitored so you will know when it fails. Virtually all vendors will provide a mechanism for monitoring their hardware, but this is one area where the tier 1 server vendors do the best job. In particular, some like Dell and HP can integrate their hardware monitoring into the more common general monitoring solutions (like Microsoft Operations Manager) which gives you fewer pieces of infrastructure software to maintain.
Firewall Monitoring
Most firewalls are based on a UNIX derivative, very commonly Linux. There are several reasons for this, but the most salient typically are that you want something you can strip down to the bare minimum necessary to do the job and you don’t need or even want a user interface. This should be a dedicated appliance, and you don’t want to have hard disks in it either since they are a major point of failure and there just shouldn’t be a need. Additionally, if you’re an all-Windows shop there is value in having a small bit of heterogeneity in your environment: If your firewall is Linux and your web servers are Windows, it’s extraordinarily unlikely that a particular software defect exploit can work at both layers.
Metrics
Different firewalls support different detailed events, however if your firewall supports SNMP then you can probably combine its metrics with the server and network metrics together. If your firewall doesn’t support SNMP, you’ll want to have that on your feature list for the next one. There’s high value in having all of the basic infrastructure metrics in one place.
- For each network interface:
- bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
- Interface speed: The current connection speed of the interface
- Processor
- Utilization: Total processor utilization. On a UNIX system, substitute Load (which works entirely differently). Under the covers, your firewall probably runs a variant of UNIX.
Most firewalls also will have counters available for key firewall-specific security metrics such as connections and connection denies, however for the purposes of a dashboard it’s generally easier to drop into the firewall’s specific administrative tool to review what’s going on. Again, our purpose here is to create a dashboard with information that has the most value when looked at over time and is used to help isolate problems to specific nodes.
Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.
Event Monitoring
Most firewalls are based on UNIX (Linux in particular) so they tend to use the conventional UNIX logging facility: Syslog. If your firewall vendor doesn’t provide a dedicated logging collector and it supports syslog, purchase a syslog server package and install it on one of your servers. You should have at least one server (physical or virtual) that you have set aside for IT administrative purposes like this.
At a minimum, you want to collect a log message for every socket attempt that is denied by the firewall. This is very useful in diagnosing odd problems that don’t seem to have other explanations. I don’t recommend collecting each valid socket attempt because of the volume of information that represents.
Even if your firewall supports it and you have the capability to do so, I don’t recommend using SNMP for collecting these events. The volume can be very high at times as Internet worms and the like attempt to seek a hole in your firewall.
Network Switch Monitoring
There are several situations where you will want to be able to collect information directly from your network switches. Not every switch in your environment needs to support SNMP for collection; just the switch ports that are handling switch-to-switch traffic and ports where you have a server or network appliance that you can’t otherwise monitor. Depending on your switch hardware you will want to send these events to a Syslog server (if you have one) or as SNMP traps to an SNMP monitor. I don’t particularly recommend the latter because some events (like physical layer events) can get voluminous if you have switches that serve desktops.
Metrics
You want to be able to gather metrics on at least one side of each Inter-switch link to be able to troubleshoot capacity issues between switches and you want to be able to gather metrics on each shared device that you haven’t already covered directly via SNMP. Remember that when gathering statistics they will be “reversed” from the perspective of the switch compared to the devices: What is OUT from the device will be IN to the switch and vice versa. When monitoring a server or appliance at the switch side, I recommend labeling it as the device instead of the switch and reversing the direction labels so it is consistent with the rest of your devices.
You want to capture:
- For each monitored network interface:
- bps in and out: This is often captured as a bytes transferred and then the collector has to work that back into rates.
- Interface speed: The current connection speed of the interface
Note: If monitoring interfaces that can run at gigabit or greater speeds, you will want 64-bit counters under SNMP v2 or better to prevent counter overflow creating erratic, irrational readings.
Event Monitoring
When monitoring switches, I’ve found the most important events to capture are:
- Physical layer connect/disconnect: This will often highlight flaky cables and drivers, and situations where the switch and the server are auto-negotiating a port speed and failing. You did set your servers from auto-negotiate to manual for each port, right?
- Spanning Tree: Many problems in switches, particularly if you have a number of small switches interconnected, come down to spanning tree kicking in at unexpected times. If you can capture these events, it can help you correlate problems back to them.
Power Monitoring
If you are using an APC UPS or other similar device, get the SNMP network interface card. With this you can generally capture events and metrics back into your monitoring system for power events.
Metrics
You want to capture:
- Line Voltage In: The input voltage to the UPS
- Line Voltage Out: The voltage being fed to your servers. If this starts fluctuating, you have a problem with the UPS.
- Amps In: How many Amps of current the UPS is taking. If not available, look for a Watts or VA (Volt Amps) counter.
- Amps Out: How many Amps of current the UPS is taking. If not available, look for a Watts or VA (volt Amps) counter.
If available, also capture:
- Runtime available or Battery Capacity: Useful if you have power events to see how quickly your batteries are draining in a real load
- Battery temperature: UPS can experience high temperature swings when in use, and temperature is a killer of the lead acid batteries they typically use.
Event Monitoring
You want to monitor the following events into either Syslog or your SNMP monitor. The volume should be very low, so the SNMP monitor system is likely a better choice. You should also configure the UPS to email you when these events occur, particularly if you don’t have that set up for your SNMP monitor.
- Line Undervoltage: This captures a power outage (voltage goes to zero) and undervoltage due to line sags (typically too much load on the utility feed)
- On Battery/Off Battery: Each time it transitions to and from a battery for any reason. This may or may not be due to a utility problem.
How do we do it?
We follow our own advice – both at my previous startups and at eSymmetrix. Initially we set up MRTG to do our basic monitoring, but it was difficult to keep operating effectively, particularly with Windows which had a habit of changing the SNMP Id’s of network interfaces. After working with Microsoft Operations Manager it was just too slow at displaying useful metric information. We eventually found PRTG from a German company (www.paessler.com) and we’ve used it ourselves and recommended it to our clients. It’s pretty cheap, and most importantly includes an SNMP helper for windows that gets around a range of issues we had with MRTG. If you’re willing to trade a little money for a good savings in time, it’s a great tool.
Have another tool that’s worked well for you? Some other metrics that you think are must haves? Drop me a line and or leave a comment to let us know.
Tags: Metrics, Monitoring, PRTG, SNMP, Syslog
Posted in Monitoring | No Comments »
Memory still matters, at least until you go 64-bit
Written by Kendall Miller on February 19, 2008 – 1:23 amThe price of memory has continued to go down over the past several years, making it very cost effective to solve many memory utilization problems by purchasing more memory instead of optimizing your application. Consider that if your application runs in 500 MB, it will cost you more to have the developers reduce that footprint down to 256MB than the memory you save. Now, this won’t be an option in the case of a COTS application you’re shipping to customer’s desktops, but it can be very effective in web servers.
Your developers probably already know this – they gave up bothering about optimizing memory until they run into a real problem, it’s all about writing more features faster now, right? Everything’s grand up to a point… the inherit limitations of 32-bit memory space. Most people are familiar with the issue that you basically can’t put more than 4GB of memory in a server under 32-bit (because who’s going to invest in programming for AWE now that 64 bit operating systems are readily available?) and that means each process can only get 4GB of well, even if it’s a 32-bit process on a 64-bit OS. This is where the problem comes in.
In practice, if you see your 32-bit process using more than about 1.5GB of memory then you probably are running it out of RAM. Why? It has to do with two issues: Memory fragmentation and CPU design.
A bit of CPU design and NT history…
Way back in the day, Microsoft decided that Windows NT was going to support multiple processor architectures. It originally shipped on three processor types – DEC Alpha, MIPS, and Intel 386. One more was added for NT 4- PowerPC. These processors all differed in a number of respects, and the NT design team had to make some concessions to commonality. One such concession was how to split up the 32-bit process. Each of these processors reserves some of the address space for instructions and some for data, called the code space and the data space respectively. This is fundamentally part of the design of modern processors for performance and process safety reasons. When the designers of the various chips looked at the problem, most decided that you’d never need more than 1GB for instructions for 3GB of data. The one odd man out was the MIPS – it was designed around an equal sharing of 2GB for each, and it couldn’t switch modes. Accordingly, Windows NT was designed to accommodate the same model, and to make it easier to port code between the various processor types it was decided that every architecture would use the same model.
Practical process limits for 32-bit
Out of the box, a 32-bit process is configured with 2GB for instructions and 2GB for data. Why then should you be worried if you see utilization higher than 1.5GB? Two main reasons:
- Nearly all of that utilization will be data space, so it counts against 2GB instead of 4GB. I mean really – how large are all of the binaries in your application? That would be the very greatest possible utilization of application space.
- Things have to fit into memory in contiguous chunks, and can’t be moved once they’re created. If the runtime can’t find a big enough space to meet your request, you’re out of memory.
In practice, it’s the second issue – memory fragmentation – that is going to kill you. In production it’s a big problem for two reasons: First, it’ll cause bizarre low level problems that will report in interesting ways because when it runs out of memory, who knows if it can get enough memory free to nicely report the problem intelligently. The problem tends to be worse in highly abstracted environments (like .NET) because there will be a lot more distance between your code and the raw memory of the system. For example – if you’re using relatively small objects by the time it runs out of enough contiguous memory to create them, will it have enough to create that nice exception object with a stack trace?
Another reason memory fragmentation is very problematic is that it will often take time to show up. For example, your application may be just peachy keen when you run your unit tests against the whole production data set because even though it allocates lots of objects and uses up memory, everything is nice and contiguous. As time goes on and objects are freed and created, you’ll tend to get pockets of memory that get progressively smaller. This particularly happens if your application is casual (and nearly all are) about how it allocates objects that are going to hang around and objects that are very short lived. In short, it is very hard to predict what will finally start causing it to fail to allocate objects. It will also start with a few failures and then progressively escalate as the process continues to be used.
Microsoft has your back with IIS
For web applications, there are some safety features built into IIS that can really help out, and this is why they’re there. You can configure it to automatically reset the process after a certain number of requests, certain number of requests and most importantly maximum used memory. You’ll want to set this to a value less than 1.5GB. It will then nicely halt the request pipeline, restart your process, and resume processing. This will cause a notable delay in processing, but that’s much better than the alternative.
COM+ has similar capabilities.
Services are for serious developers
Sure, anyone can whip one out now with the wizard in Visual Studio, but if you really have your development chops together then you can write a service that will run reliably without the coddling safety net of IIS and COM+. The point isn’t getting your service to run or even execute its functional tests – it’s that it can operate for months at a time doing all of the work it should without leaking memory or fragmenting RAM to the point that the process starts experiencing problems.
If you aren’t sure quite how to get that right, you can use COM+ to cheat a bit – it can similarly automatically reset itself under high memory utilization. This is often going to be a lot easier than the cost of finding a minor memory leak that adds up over time. I’m not advocating that it’s OK to leak memory, just that you may need to pick your battles.
Memory leaks and garbage collectors
If you’ve been reading through the above and thinking smugly I use .NET/Java/VB6/Whatever and it’s garbage collected, we never leak memory! Well, I’ve got news for you: First, you can leak memory if your fancy runtime leaks memory. It makes no difference whether technically it was your code or the runtime in response to your code, if your memory utilization goes up with each request then somebody’s leaking something. Second, while it isn’t fashionable to refer to it as a memory leak if you create an object structure that isn’t cleaned up when you intend it to be, that’s a memory leak too. This was pretty easy in VB6 and somewhat in Java, somewhat harder in .NET because it detects multiple step circular references that are the only reason objects are still around, but the bottom line is that if you aren’t clear in your data structures you can consume more memory with every request, and that’s effectively a memory leak.
Finally, less memory is still better
Keep in mind it takes times to free memory and allocate memory. If the OS has paged parts of your memory to disk (which it will do even if it has free memory to save time when under memory stress) it has to release that as part of freeing up your process. This all hits disk, and disk is much much slower than RAM. Bottom line – controlled, reasonable memory access is the best for ultimate performance.
What’s your experience?
How do you make the design decisions to balance memory usage with caching and other requirements? Post your comments or drop me a line to continue the conversation.
Tags: COM+, IIS, Memory leak
Posted in Infrastructure, Software Development | 1 Comment »
Why you should use Microsoft Cluster Service (MSCS)
Written by Kendall Miller on February 18, 2008 – 2:15 amIf you go through the web and do as much research as you can, you’ll find very polarized opinions about MSCS. I’ve been using it since 2002 and have found it to be outstanding, but I can see some pitfalls that could create a bad rap for it.
Why are you clustering?
First, I think Microsoft does it a miss-service in how they market it. Instinctively, most people focus on using MSCS in case a given computer’s hardware or operating system spontaneously fail. I’d say that in operating a number of clusters over six years in time, this was a very rare event for us. In fact, it only happened when we had some brand new hardware fail within its burn in period. Instead, we’ve found that its great value is in reducing downtime due to maintenance activities.
Example Server Update
Consider the scenario of needing to install the latest patches from Windows Update on your database server. Below are the steps you could go through without clustering:
- Wait until your maintenance window (let’s assume it’s 1:00 AM on Sunday morning, the low time of your load profile).
- Take the applications that use your database server offline (to be nice to your users and ensure everything closes).
- Install the patches on your database server
- Reboot your database server
- Verify that the server works (that the patches haven’t introduced a problem)
- Bring all applications back online
What’s noteworthy in the list above are the items that have a variable duration (it may take a different amount of time each time you do maintenance and may not be particularly predictable) vs. a fixed amount of time. In particular, #3 and #5 are variable (and #4 may be.).
Now lets play that again if you have MSCS installed:
- Install patches on the offline database server node.
- Reboot the offline server.
- Wait until your maintenance window
- Take the applications that use your database server offline (to be nice to your users and ensure everything closes)
- Failover to the offline server
- Verify that the server works (that the patches haven’t introduced a problem)
- Bring all applications back online.
- Wait a reasonable period of time (like a few days) and install patches on the server that’s now offline
- Reboot the offline server.
It is more steps (because there are two servers involved) but what we’ve done is moved things that take variable time outside of the critical window when the system is in maintenance mode. Everything that is happening during the maintenance mode (steps 4-7) is predictable. Additionally, I consider any server reboot to be risky. Problems tend to show up during a reboot that show up at no other time – hardware problems and even in a reasonably tight environment it’s possible there’s a configuration change made that hasn’t taken effect yet that will on reboot and cause a problem. With an MSCS cluster, this risky event is happening while the server is offline and won’t affect the production use of your application. You’ve also verified the basic integrity of the patches (after all – the server booted and you can monitor its event log to know its basically healthy) before even scheduling your maintenance period.
The comparison gets even better when you consider what happens in the first scenario above if you need to roll back a patch. With a cluster, you just fail back to the original node and you’re good to go. Without a cluster, you have to uninstall the patch, reboot, and re-certify.
Benefits Summary
- Clustering makes system maintenance predictable and short.
- Clustering lets you do risky things during main business hours instead of the middle of the night
- Clustering lets you roll back a change very quickly and easily
If you’re clustering for these reasons, you’ll get great value out of it.
How are you clustering?
Shared Storage – The Traditional Approach
Microsoft has worked to make MSCS work with a pretty broad range of hardware to their credit. Traditionally, MSCS depends on being able to expose disks to more than one server at the same time. This can be done with the traditional server direct attach storage (DAS) technology – SCSI (and now SAS) however it relies on a set of very intricate hardware – RAID controllers in each server, special cutover terminators in the storage enclosure, etc. There is a lot that can go wrong, and when it does you may lose all of your data. For example, the configuration in the RAID controllers has to agree on what the virtual disks are. The shared storage was used at least for a special drive (called the Quorum drive) that stored central cluster configuration data and defined who was the current active node of the cluster. Additionally, any clustered service (like Microsoft SQL Server or Exchange) would typically have its disks also shared between the nodes in the cluster. If you don’t need to split your clustered nodes into different data centers (to create a geodiverse or “stretch” cluster) then this is a solid and straightforward way to go.
What I recommend is that you use a storage technology that encapsulates all of the RAID technology separate from the servers and is based on a technology that is fundamentally oriented towards sharing disks with multiple servers. This way you minimize the configuration on each server and the probability that a difference between servers will lose data. The traditional way of doing that is with a Storage Area Network (SAN). If you consider the two primary SAN technologies (Fibre Channel and iSCSI) both are fundamentally about sharing storage with multiple servers.
If you are only installing a shared storage array for one cluster, you can technically do without the hardware that makes a SAN a SAN – you can have a shared array directly attached to two servers. Most storage arrays support this, and it’s a very cost effective way to get started with separate storage arrays and be able to build later on this foundation to make a full size SAN down the road to optimize your operating costs. You’ll realize another benefit which is that these arrays are almost universally much faster and more scalable than direct attach storage is, for a range of reasons. You’ll be amazed at how much scalability it adds to your database server.
Shared Nothing Approach
Possible in Windows Server 2003 R2 Enterprise, significantly improved in Windows Server 2008 is the ability to set up a cluster that doesn’t rely on the quorum drive being a single physical resource. Instead, it employs a third server (called the Witness server, which can’t actually host the clustered processes) that each node in the cluster can talk to across the network or voting between the servers in the case of three or more nodes being in the cluster itself. The elimination of requiring the quorum to be physically accessible to every node on the cluster means that services that don’t rely on shared storage (such as a simple Windows service) can be easily implemented. This can even extend to Microsoft SQL Server and Microsoft Exchange in their latest version because they are capable of replicating their own content through log shipping. The sheer number of options here can be a lot to sift through the first time, but the results are worth it.
My Personal Experience
I’ve always used a SAN from a major vendor that certified the SAN for use with MSCS, and never experienced problems with MSCS. Use them, or don’t use MSCS based on shared storage.
The most important factor to being successful with failover clustering is to use high quality hardware for the server and storage system. Look for vendors that have certified their systems for use as part of an MSCS cluster to ensure they got all of the little details right.
Where should you use MSCS?
MSCS is a failover cluster system. Use it when you can’t use a load-balanced clustering option. In general, this is when there’s a natural requirement to have just one of something at a time, most commonly databases (because to be performant they need exclusive access to their files). If you have a load-balanced clustering option, it’s probably going to be less expensive to set up and maintain than MSCS.
If your organization is a solid user of Microsoft SQL Server, I highly recommend investing in at least one MSCS cluster to host your SQL database servers. You can use a single physical cluster to host multiple SQL database servers, an option that makes it particularly cost effective. You can set server affinity so that two instances of SQL Server prefer to run on different physical servers within the cluster, giving you the best utilization of hardware while preserving redundancy It is somewhat more complicated to set up because you have to use logical servers from the start with SQL Server which you don’t have to if there is just one, however the cost savings can help justify clustering. You might, for example, have both a certification and production SQL Server on one pair of physical servers in an MSCS cluster. This makes it somewhat easier to ensure that your certification and production environments are absolutely identical and lets you generally separate certification and production from interfering with each other without having to purchase two separate clusters.
Advanced clustering scenarios
Remember that while most articles and documentation talk about the basic clustering case of two servers & a SAN or other shared storage, as of Windows Server 2003 you can have more than two nodes and can have them use separate shared storage, provided that you have a means to synchronize it. This can be used in a few great scenarios:
- Geodiversity: You can have two separate facilities, each with one or more servers and fail over between the facilities.
- Upgrades and Maintenance: You can use the ability to have additional nodes and separate storage to allow you to take the shared storage system entirely offline in the event of disruptive maintenance or upgrades. I’ve actually used this method to incrementally upgrade and replace cluster systems before where taking the risk of a complete switchover was considered too high.
Moving from basic clustering with a single shared storage array to separate storage arrays is a significant jump in complexity and typically cost because you have to have a highly reliable means to keep the arrays in sync. High end storage vendors typically have this capability for their arrays, and there are third party options that can work with anyone’s SAN. Remember that you will need significant network capacity between your sites. Suffice it to say that if you’re going to go down this road, you’ll want help from someone that’s done it before. I recommend engaging storage professionals because this tends to be the most difficult part of the process.
What’s your experience?
Have you used MSCS? How has it worked out for you? Post your comments or drop me a line to continue the conversation.
Tags: Clustering, High Availability, Infrastructure, MSCS, SAN
Posted in Clustering | 2 Comments »
Top three things to improve reliability
Written by Kendall Miller on February 9, 2008 – 2:03 amQuick – what are the three things you should do to make the great improvement in the reliability and availability of the systems you’re responsible for?
Marketing for IT products and the general media tend to emphasize opportunities to purchase reliability. This makes sense because they’re in the business of selling things. Classic examples are the emphasis on extraordinarily redundant server hardware. A modern server can be purchased with redundant disks, redundant power supplies, redundant memory, and even in some extraordinary cases redundant processors. This is designed to let them prove that their server hardware has a staggeringly high mean time between failure, and who wants to be the IT manager that takes an outage because they didn’t purchase a reliability option they could have.
Before charging down the road of buying ever more elaborate hardware redundancy, let’s sit back and look at the big picture of where failures are coming from.
- A well trained person will make a mistake on the order of one time for every one hundred opportunities. Not all of those mistakes will result in an outage, but many will.
- If your solution employs any custom software, it is far more likely to have a problem that would cause an outage than widely-used off-the-shelf software. As a general rule of thumb, the longer a piece of software has been used, the more reliable it has become because the logic errors in it have been found & resolved.
- Hardware fails in a well established bowl shaped curve with most failures occurring while the hardware is very young (typically in the first 60 days it is operating) and then the failure rate starts picking up again in approximately five years for enterprise hardware, three or so for consumer grade hardware. Even then, the failure slope is typically very gentile.
From basic reliability monitoring (link to detail) we get the following points about improving the availability of a given system:
- To improve the reliability of the whole system, focus on the worst item. Nothing else will have a useful impact.
- Reliability only gets worse when you add new components to the system that have to function for the system to function.
- When you employ load balanced clustering, controlling how long it takes to fix a down system is a significant driver in the effective availability of the system. This is often referred to as the Mean Time To Recovery (MTTR). This means you must employ monitoring to detect when a redundant item isn’t working so you can restore redundancy as soon as possible.
- Failover clustering is primarily for having predictable, controlled downtime which ideally is during maintenance periods that do not count against your availability. Its primary benefit is consistency and scheduling.
Now that we’ve gone through that groundwork, let’s go back to the original question: What can we do that will have the most effect on the reliability and availability of our system?
- It’s the people & processes: Human error is the single greatest cause of downtime. In nearly all cases, you can get your best overall improvements by reviewing the people factors that drove your availability.
- Make new systems prove themselves: Whether it’s hardware or software, give it some time running where it ultimately will live before you trust it. About 60 days for most server-grade hardware will identify the hard drives that are going to fail (by far the most likely failure) and even less (10 days) will typically illuminate electronic demons such as memory, network cards, etc.
- Install Monitoring: However you do it, make sure you have monitoring so you know positively that things are healthy, and that you’ll get alarms when they are not. Having a RAID array doesn’t help you if no one notices the first disk die.
What’s your experience?
Have a great story to share? Disagree with this approach? Post your comments or drop me a line to continue the conversation.
Tags: High Availability, IT Management, Process
Posted in Clustering, Infrastructure, Software Development | No Comments »