Latest Posts »
Latest Comments »
Popular Posts »

Pick Your Scale, any Scale.

Written by Kendall Miller on July 6, 2008 – 11:51 pm

Let’s say you’re starting a project to create a new software system. How big does it need to scale? Realistically, either:

  1. This new system fits into an existing business, possibly replacing a prior application, so you can predict with some accuracy the different aspects of scalability that apply to it.
  2. It doesn’t, and you can’t.

The second scenario is the most interesting one. First off, let’s face it - your new system isn’t going to be the next Facebook, MySpace, or eBay. In short, you don’t need to worry about having your system needing to be designed front to back as a super-scalable system. This is good because the options at that level are time consuming and resource intensive.

The key question you need to understand when laying out a new software system is to what degree it needs to scale without being re-written? This scale is unlikely to be your “best case” business size, because scalability has opportunity cost. This scale should be defined as specifically as reasonable, and clearly understood and validated by both business and technical staff. This ensures that if your business grows beyond expectations that it won’t come as a surprise if you need to make even major changes to your system.

Creating facts from Air

Let’s say you’re starting to develop an application that fits into the second category above. You still need to work out what your scalability target is.

To make any decision that is better than random, you have to work out some aspects of the expected scaling of the application. In the absence of real facts to extrapolate scalability from, you need to cooperate with the business side to established presumed facts of the scalability requirements. This may sound a lot like assumptions, but they really go beyond that because these will become facts as you develop the system. As a starting point, make it clear to all involved that:

  1. If the targets are low, it should be assumed you’ll have to turn away business because the system can’t scale above them.
  2. If the targets are high, the system will cost more and take longer to create.

In most businesses, the second outcome is worse than the first. Why? Because the second is a price you pay up front, before the system goes into service. The first is based on an assumption: you might have to turn away business. You also might be able to realize it in time and address the issue. From a business standpoint, this is a better trade off. Finally, there’s the non-technical aspects:

  1. The sooner you have a working system, the sooner the business can validate the market and start getting real data on uptake to adjust your scalability goals
  2. Unless the product is a failure, you expect demand to eventually exceed the capacity of the system, it’s just a matter of when. If it does, then you should be able to afford rewriting all or part of the system. In other words, the funds to solve the problem should be available if you have the problem.

From this comes an axiom of scalability:

The system needs to be based on the lowest scale that will provide enough time and money to replace it with a new system.

Put another way, a system that is faster or more scalable than it needs to be for the business was more expensive and took longer to develop than necessary. Think of it like a race car: The ideal Indy Car would fall apart just after the judges validated it won without breaking the rules. Any stronger and that strength could have been put into something else. The time you spent making it more scalable than necessary could have added more features, fixed more defects, or gotten it out the door sooner.

Establish a Growth Curve

The growth curve needs to be sufficient to inform the developers of what decisions to make at each point. To get there, start with describing the scale from the business stand point. During design of the actual system you can keep translating this into the specific requirements for speed, storage, and capacity based on the behavior of the actual system. This will prevent you from achieving technical goals that don’t satisfy the business goals.

For most systems, you want to establish the business goals for:

  1. Number of Possible Users: How many accounts will there be on the system? This is an upper bound of the number of people that could access the system if they wanted to.
  2. Number of Simultaneous Users: Number of accounts that will be accessing the system at the same time. For most applications, at the same time is likely best thought of as in the same 15-30 minutes.
  3. Number of Customers: For most applications delivered to businesses the number of customers (e.g. businesses) drives the scalability of some parts of the system (such as configuration and data storage) will scale based on the number of customers, not the number of accounts those customers have.
  4. Data In and Out: If the system is going to have any imports and exports that aren’t user-driven (such as EDI feeds or a public API) then the number of partners (other entities that will exchange information with you) and the frequency of exchange need to be determined.

Things to not bother with:

  1. Response Time: For customer interactive products, response time is dictated by what end users will tolerate and is not really going to be a business decision (aside from deciding if you’re going to produce something your customers are willing to use). For non-interactive products or back-end this may need more discussion with the business, but again - the business is going to expect you to be able to figure out what will make it a success.
  2. Data Retention: Assume it all has to be kept and more indefinitely. In the end, storage is cheap and this design decision rarely costs a lot of made up front but is expensive to reverse. Data also has the amazing power to make heroes out of IT when the business starts posing questions later and you can answer them. Generate as many facts as you can now to help you out later.

These items are past the point of diminishing returns with the business. You should work them out within the development team and document them, but you shouldn’t believe that any business sign off you might get is binding or useful.

Build to the Scale

Once you’ve established your growth curves, pick your candidate architecture and translate the growth curves into system performance requirements.

Hypothetical Example: If you need to support 1000 simultaneous users for a web application, determine the dynamic web hits per second by determining how often an average user will request a dynamic page (say ever 5 seconds, which is very fast for most dynamic applications) These two numbers would give you a dynamic hits per second of (1000/5) = 200. Then add how long each page will take to calculate (make a goal of say 250ms) to get how many requests you need to be able to process at the same time: (200 * 0.250) = 50. This is the key scale point for your web application: When deployed, it must support 50 requests being processed in parallel. You’ll need to get to this point by either making it really scalable on a single server, or splitting the load over multiple servers.

One thing that should jump out of the math behind this is that anything you can do to make the calculation time of a single page drop pays big dividends: If you drop the average calculation time by half (125ms) then the number of requests in parallel drops by half (200*0.125) = 25. This in turn may well cut the number of servers you need in half, easing your maintenance and deployment cost. If you can’t do this, reduce the number of dynamic pages requested per second by either making more static pages (such as pre-rendering pages that change but don’t change frequently) or caching dynamic pages that have some predictable consistency (which really makes them static pages). This is often much trickier to do and test, so your best first option is to reduce the time for each page.

Side Point: This also highlights an easy way to accommodate guessing low on a system that’s been in service for a year or more: If you’re processor bound you can replace that hardware with current units and often pick up 30% per year it’s been since you purchased the original hardware. This won’t save you from network problems, disk storage problems, or some memory problems, but it is surprisingly handy.

As you look at each candidate architecture, look at each component and determine the critical “how much, how fast, how often” factors based on the business inputs. If you change your architecture or external interface design (the user interface or import/export capabilities) you need to re-evaluate if you’ve moved the targets as well because your design goals no longer reflect the business growth curves.

Really, to the Scale

Within your development team you will typically have two types of developers you need to watch: Those that never consider scale and those that obsessively consider scale. The former will build it however and then wait to see if there is a performance problem. The latter will try to make every system the next Amazon. Neither situation is good. Identify early people’s tendencies and work to manage them to the center. Remember that the system is only as scalable as its slowest part, and there is always a slowest part.

You can get good results by having the people that are most concerned about scalability move around on the project to different subsystems. This will tend to keep them too busy to earn the keeper of the nanosecond award on any one system (which they will do if you let them stay put and just work on one system) and will make it unlikely that more cavalier developers can hide a problem. It will also help the team learn from each other: It often isn’t worth making a specific feature as fast as possible, and it is always worth thinking about what will make a feature fast before coding it.

Finally, budget time in the development team to fix scalability issues. Regardless of how much work you put into it, once the real system is build and tested you’ll find places that are slower and less scalable than you expected. If nothing else, you need to develop an accurate model of how the system should perform in production so you can check the real world against it later. As your business grows, you need to be able to get ahead of it and understand when it is time to make the code faster, add hardware, or do something else to stay one step ahead.

Disk is Your Friend, but Beware the Network

If you’ve gone over the system from nose to tail and you’re disk bound, you’ve probably optimized that design as well as you can. Disk has gotten faster at a much slower pace than memory or processor, and being disk bound means you’re getting all the requests where they need to go in a timely manner and are able to process the inputs and outputs, so now it’s in the hands of the hardware. Unfortunately at that point there generally isn’t much more you can do: The difference in performance between server drives and the fastest drives money can buy isn’t very much.

If you’re finding that you aren’t disk bound and you aren’t processor bound then be worried. You’re either network throughput bound or you’re network latency bound. If you’re network throughput bound, you can probably fix it cost effectively with some basic engineering either in how you select what to send across the network or what you cache so you don’t need to send it across. You should try to give yourself some headroom here for growth, but faster networks can be purchased and you can generally tweak the software to mitigate this in minor updates.

Being network latency bound is a more serious issue because it often means that you are at the practical scalability limit of your application. The difference in network latency between relatively cheap hardware and the best hardware isn’t very much, and has been essentially constant for the last 10 years. You can’t buy your way out of this problem. It also is typically caused by a badly designed interface between components of the system which will need to be substantially or entirely rethought and rebuilt to address, which isn’t easy to do with a running system. If you find yourself in this situation and you aren’t sure you have met your business goals you should rethink your approach immediately. Because no amount of money on hardware can get you out of this problem, caution is the word of the day.


Tags: , , , , ,
Posted in Management, Software Development | No Comments »

So Why are You Still Hosting?

Written by Kendall Miller on June 13, 2008 – 1:18 am

Right now, the power is out at my home. That doesn’t happen often - in fact, it’s been almost two years since we lost power long enough for my UPS to shut down my home network. Normally this would be a small inconvenience, but I still host a few things for my wife out of my house which are now down. The largest of these is a fairly popular forum for an author she likes, but there are other sites as well.

Why am I still hosting these at home? Really there’s no reason - I’ve shifted hosting for my personal services out to other providers, and our company services are also hosted by hosting companies. I just haven’t moved her stuff out of my house.

We talk with a lot of small and medium sized businesses that are still hosting all of their own services internally for pretty much the same reasons - they originally had them in house when they were much smaller and the market was different, and haven’t considered what it would mean to have those computers live somewhere else. It’s time for a change.

Why It’s time to Use the Cloud

You should look at all of your important business services - things that your business can’t operate without - and work out a plan to no longer host those items in your facility. As a first step, just consider what it means to provide the same applications and services, but have the computers not live within your company. The main goals for moving these services out are:

  1. Business Agility: When you use a hosting company it’s easier to change capacity as your needs change, even to bring services up temporarily as a trial run and then shut them down if they don’t pan out. This makes it easy to experiment with new software technology without the traditional problems of hosting getting in the way.
  2. Low Cost Reliability: If you want those services available, the cost to outfit a room to provide redundant cooling and power for a single rack of equipment is easily $50,000. To host one rack of equipment in a basic Tier-2 data center can cost around $1,500 to $3000 a month, which includes power and Internet. At that rate, how quickly will you get an ROI on your facility investment?
  3. Improved Focus: Getting this equipment out of your shop improves your focus on the things you really need to be spending time on: Projects for the business and end-user support. The rest of it is overhead.
  4. Access from Anywhere: When you set up your services so they can live in the cloud and be used from your office, it’s easy to make those same services available to employees from home and from laptops. Not as second class citizens but with all of the ranks and privileges of being in the office. This helps you leverage employee talent wherever it is. It’s also easier to set up rock-solid extranet access for customers and suppliers.

When you start looking at each thing you provide as a service, you might also find that some of them - like Microsoft Exchange - really aren’t worth hosting yourself at all even in a data center, and it’d be ultimately in your best interest to outsource it entirely to a hosted Exchange provider. There are number that can do this very effectively. While the cost may seem high based on what it cost you to purchase your initial Exchange licenses, when you look at the real cash costs for Exchange over two to three years they are very cost effective.

Once you’ve taken the step of taking an existing service and outsourced it entirely, you might even consider a Software as a Service offering for some of your core services (such as a hosted CRM). This is the most aggressive mode of outsourcing and does create a set of unique risks and opportunities.

But I can’t See It

Two common objections we hear from IT administrators about moving services out of their shop, even if it’s just relocating servers into a data center. is that it will make it hard for them to get upgrades when necessary because the business won’t be able to see & feel the new equipment. Out of sight, out of mind as the saying goes. The second main objection is that the IT administrators want to be able to do a laying of hands on the equipment to maintain it. There’s a comfort factor in knowing you can walk into a room and flip the power switch or move a drive or just bask in the warm glow of blinking lights.

Here’s the good news: Both of these reasons are not only suspect in their own right, but are preventing your shop from getting to the next level in IT’s relationship with the business.

First, even though vendors do a good job of making server hardware look serious and fun, in the end it’s just a business appliance: It either is good enough to deliver for the business or it isn’t. With rare exception, there is no extra business value for it to look good, new, or cool. If you find that you need to show the business physical servers to explain your costs, you’re missing out on the critical opportunity to establish a real partnership between business and IT. You need to be sure you’re spending when it’s time to spend and saving when it’s time to save, and have discussions in the language the business would use for any other service it would acquire.

Second, If your IT administration patterns and practices require routinely touching your physical infrastructure then you need to re-examine them. It generally means you either have equipment that is no longer up to the task or that you’re not doing enough automation of IT tasks. If you have trouble-prone hardware, it’s time to either fix the fundamental issue or ditch the hardware. Ironically, this type of problem is often easier in a hosted environment because it generally isn’t your problem: it’s the hosting company’s.

Automation is essential because humans are the most error-prone part of any standard process. Your routine IT administration time shouldn’t be going to consistent tasks - they should be automated, leaving your time for user support and other business value-add services. That’s right - even in your shop with your existing staff you can find more time to spend on projects instead of support events by automating recurring tasks.

Some Things Still Stay

There are some things that should be on site for performance reasons. Regardless of how big your Internet connection is, you’re going to want basic file and printer sharing services to be local. Depending on the size of your site, you’ll probably also want a directory server for whatever your directory system is (e.g. Microsoft Active Directory). Even here the central services help: If you have a reasonable Internet connection, you can have your local file server back itself up to the data center by using one of a few distributed backup systems (such as Microsoft’s Data Protection Manager or a third-party option like NSI Software’s Double-Take). This eliminates the time and attention that local disk backups require.

Perhaps not Now, but Soon - and For the Rest of Your Life

It may not be appropriate to move a number of your services outside yet; If you have only one business site, light access by employees externally, and aren’t expecting that to change then you can host most things yourself. A number of the considerations still apply - but you might just use an external facility for your public web presence and for backing up your essential data for business continuity.

Even if you don’t do much now, you should find some opportunity to put a service outside so you and your company can gain experience at working with external hosting providers and you’ll stay current on the capabilities and costs so that as new business requirements evolve you’re ready to take care of them. You’ll be in a better position to advise your company on when to move things out of the shop, and as you do you’ll discover that instead of focusing your time and talent inward at the routine operations of infrastructure you’ll have time for those projects that really make a difference to your business.

How Has the Cloud Delivered For You?

Have a story about what has and hasn’t worked with hosting? Drop me a line or post a comment to share it.


Tags: , ,
Posted in Infrastructure, Management | 2 Comments »

Why you should use Microsoft Cluster Service (MSCS)

Written by Kendall Miller on February 18, 2008 – 2:15 am

If you go through the web and do as much research as you can, you’ll find very polarized opinions about MSCS. I’ve been using it since 2002 and have found it to be outstanding, but I can see some pitfalls that could create a bad rap for it.

Why are you clustering?

First, I think Microsoft does it a miss-service in how they market it. Instinctively, most people focus on using MSCS in case a given computer’s hardware or operating system spontaneously fail. I’d say that in operating a number of clusters over six years in time, this was a very rare event for us. In fact, it only happened when we had some brand new hardware fail within its burn in period. Instead, we’ve found that its great value is in reducing downtime due to maintenance activities.

Example Server Update

Consider the scenario of needing to install the latest patches from Windows Update on your database server. Below are the steps you could go through without clustering:

  1. Wait until your maintenance window (let’s assume it’s 1:00 AM on Sunday morning, the low time of your load profile).
  2. Take the applications that use your database server offline (to be nice to your users and ensure everything closes).
  3. Install the patches on your database server
  4. Reboot your database server
  5. Verify that the server works (that the patches haven’t introduced a problem)
  6. Bring all applications back online

What’s noteworthy in the list above are the items that have a variable duration (it may take a different amount of time each time you do maintenance and may not be particularly predictable) vs. a fixed amount of time. In particular, #3 and #5 are variable (and #4 may be.).

Now lets play that again if you have MSCS installed:

  1. Install patches on the offline database server node.
  2. Reboot the offline server.
  3. Wait until your maintenance window
  4. Take the applications that use your database server offline (to be nice to your users and ensure everything closes)
  5. Failover to the offline server
  6. Verify that the server works (that the patches haven’t introduced a problem)
  7. Bring all applications back online.
  8. Wait a reasonable period of time (like a few days) and install patches on the server that’s now offline
  9. Reboot the offline server.

It is more steps (because there are two servers involved) but what we’ve done is moved things that take variable time outside of the critical window when the system is in maintenance mode. Everything that is happening during the maintenance mode (steps 4-7) is predictable. Additionally, I consider any server reboot to be risky. Problems tend to show up during a reboot that show up at no other time - hardware problems and even in a reasonably tight environment it’s possible there’s a configuration change made that hasn’t taken effect yet that will on reboot and cause a problem. With an MSCS cluster, this risky event is happening while the server is offline and won’t affect the production use of your application. You’ve also verified the basic integrity of the patches (after all - the server booted and you can monitor its event log to know its basically healthy) before even scheduling your maintenance period.

The comparison gets even better when you consider what happens in the first scenario above if you need to roll back a patch. With a cluster, you just fail back to the original node and you’re good to go. Without a cluster, you have to uninstall the patch, reboot, and re-certify.

Benefits Summary

  • Clustering makes system maintenance predictable and short.
  • Clustering lets you do risky things during main business hours instead of the middle of the night
  • Clustering lets you roll back a change very quickly and easily

If you’re clustering for these reasons, you’ll get great value out of it.

How are you clustering?

Shared Storage - The Traditional Approach

Microsoft has worked to make MSCS work with a pretty broad range of hardware to their credit. Traditionally, MSCS depends on being able to expose disks to more than one server at the same time. This can be done with the traditional server direct attach storage (DAS) technology - SCSI (and now SAS) however it relies on a set of very intricate hardware - RAID controllers in each server, special cutover terminators in the storage enclosure, etc. There is a lot that can go wrong, and when it does you may lose all of your data. For example, the configuration in the RAID controllers has to agree on what the virtual disks are. The shared storage was used at least for a special drive (called the Quorum drive) that stored central cluster configuration data and defined who was the current active node of the cluster. Additionally, any clustered service (like Microsoft SQL Server or Exchange) would typically have its disks also shared between the nodes in the cluster. If you don’t need to split your clustered nodes into different data centers (to create a geodiverse or “stretch” cluster) then this is a solid and straightforward way to go.

What I recommend is that you use a storage technology that encapsulates all of the RAID technology separate from the servers and is based on a technology that is fundamentally oriented towards sharing disks with multiple servers. This way you minimize the configuration on each server and the probability that a difference between servers will lose data. The traditional way of doing that is with a Storage Area Network (SAN). If you consider the two primary SAN technologies (Fibre Channel and iSCSI) both are fundamentally about sharing storage with multiple servers.

If you are only installing a shared storage array for one cluster, you can technically do without the hardware that makes a SAN a SAN - you can have a shared array directly attached to two servers. Most storage arrays support this, and it’s a very cost effective way to get started with separate storage arrays and be able to build later on this foundation to make a full size SAN down the road to optimize your operating costs. You’ll realize another benefit which is that these arrays are almost universally much faster and more scalable than direct attach storage is, for a range of reasons. You’ll be amazed at how much scalability it adds to your database server.

Shared Nothing Approach

Possible in Windows Server 2003 R2 Enterprise, significantly improved in Windows Server 2008 is the ability to set up a cluster that doesn’t rely on the quorum drive being a single physical resource. Instead, it employs a third server (called the Witness server, which can’t actually host the clustered processes) that each node in the cluster can talk to across the network or voting between the servers in the case of three or more nodes being in the cluster itself. The elimination of requiring the quorum to be physically accessible to every node on the cluster means that services that don’t rely on shared storage (such as a simple Windows service) can be easily implemented. This can even extend to Microsoft SQL Server and Microsoft Exchange in their latest version because they are capable of replicating their own content through log shipping. The sheer number of options here can be a lot to sift through the first time, but the results are worth it.

My Personal Experience

I’ve always used a SAN from a major vendor that certified the SAN for use with MSCS, and never experienced problems with MSCS. Use them, or don’t use MSCS based on shared storage.

The most important factor to being successful with failover clustering is to use high quality hardware for the server and storage system. Look for vendors that have certified their systems for use as part of an MSCS cluster to ensure they got all of the little details right.

Where should you use MSCS?

MSCS is a failover cluster system. Use it when you can’t use a load-balanced clustering option. In general, this is when there’s a natural requirement to have just one of something at a time, most commonly databases (because to be performant they need exclusive access to their files). If you have a load-balanced clustering option, it’s probably going to be less expensive to set up and maintain than MSCS.

If your organization is a solid user of Microsoft SQL Server, I highly recommend investing in at least one MSCS cluster to host your SQL database servers. You can use a single physical cluster to host multiple SQL database servers, an option that makes it particularly cost effective. You can set server affinity so that two instances of SQL Server prefer to run on different physical servers within the cluster, giving you the best utilization of hardware while preserving redundancy It is somewhat more complicated to set up because you have to use logical servers from the start with SQL Server which you don’t have to if there is just one, however the cost savings can help justify clustering. You might, for example, have both a certification and production SQL Server on one pair of physical servers in an MSCS cluster. This makes it somewhat easier to ensure that your certification and production environments are absolutely identical and lets you generally separate certification and production from interfering with each other without having to purchase two separate clusters.

Advanced clustering scenarios

Remember that while most articles and documentation talk about the basic clustering case of two servers & a SAN or other shared storage, as of Windows Server 2003 you can have more than two nodes and can have them use separate shared storage, provided that you have a means to synchronize it. This can be used in a few great scenarios:

  1. Geodiversity: You can have two separate facilities, each with one or more servers and fail over between the facilities.
  2. Upgrades and Maintenance: You can use the ability to have additional nodes and separate storage to allow you to take the shared storage system entirely offline in the event of disruptive maintenance or upgrades. I’ve actually used this method to incrementally upgrade and replace cluster systems before where taking the risk of a complete switchover was considered too high.

Moving from basic clustering with a single shared storage array to separate storage arrays is a significant jump in complexity and typically cost because you have to have a highly reliable means to keep the arrays in sync. High end storage vendors typically have this capability for their arrays, and there are third party options that can work with anyone’s SAN. Remember that you will need significant network capacity between your sites. Suffice it to say that if you’re going to go down this road, you’ll want help from someone that’s done it before. I recommend engaging storage professionals because this tends to be the most difficult part of the process.

What’s your experience?

Have you used MSCS? How has it worked out for you? Post your comments or drop me a line to continue the conversation.


Tags: , , , ,
Posted in Clustering | No Comments »