Posts Tagged ‘performance’
Pick Your Scale, any Scale.
Written by Kendall Miller on July 6, 2008 – 11:51 pmLet’s say you’re starting a project to create a new software system. How big does it need to scale? Realistically, either:
- This new system fits into an existing business, possibly replacing a prior application, so you can predict with some accuracy the different aspects of scalability that apply to it.
- It doesn’t, and you can’t.
The second scenario is the most interesting one. First off, let’s face it – your new system isn’t going to be the next Facebook, MySpace, or eBay. In short, you don’t need to worry about having your system needing to be designed front to back as a super-scalable system. This is good because the options at that level are time consuming and resource intensive.
The key question you need to understand when laying out a new software system is to what degree it needs to scale without being re-written? This scale is unlikely to be your “best case” business size, because scalability has opportunity cost. This scale should be defined as specifically as reasonable, and clearly understood and validated by both business and technical staff. This ensures that if your business grows beyond expectations that it won’t come as a surprise if you need to make even major changes to your system.
Creating facts from Air
Let’s say you’re starting to develop an application that fits into the second category above. You still need to work out what your scalability target is.
To make any decision that is better than random, you have to work out some aspects of the expected scaling of the application. In the absence of real facts to extrapolate scalability from, you need to cooperate with the business side to established presumed facts of the scalability requirements. This may sound a lot like assumptions, but they really go beyond that because these will become facts as you develop the system. As a starting point, make it clear to all involved that:
- If the targets are low, it should be assumed you’ll have to turn away business because the system can’t scale above them.
- If the targets are high, the system will cost more and take longer to create.
In most businesses, the second outcome is worse than the first. Why? Because the second is a price you pay up front, before the system goes into service. The first is based on an assumption: you might have to turn away business. You also might be able to realize it in time and address the issue. From a business standpoint, this is a better trade off. Finally, there’s the non-technical aspects:
- The sooner you have a working system, the sooner the business can validate the market and start getting real data on uptake to adjust your scalability goals
- Unless the product is a failure, you expect demand to eventually exceed the capacity of the system, it’s just a matter of when. If it does, then you should be able to afford rewriting all or part of the system. In other words, the funds to solve the problem should be available if you have the problem.
From this comes an axiom of scalability:
The system needs to be based on the lowest scale that will provide enough time and money to replace it with a new system.
Put another way, a system that is faster or more scalable than it needs to be for the business was more expensive and took longer to develop than necessary. Think of it like a race car: The ideal Indy Car would fall apart just after the judges validated it won without breaking the rules. Any stronger and that strength could have been put into something else. The time you spent making it more scalable than necessary could have added more features, fixed more defects, or gotten it out the door sooner.
Establish a Growth Curve
The growth curve needs to be sufficient to inform the developers of what decisions to make at each point. To get there, start with describing the scale from the business stand point. During design of the actual system you can keep translating this into the specific requirements for speed, storage, and capacity based on the behavior of the actual system. This will prevent you from achieving technical goals that don’t satisfy the business goals.
For most systems, you want to establish the business goals for:
- Number of Possible Users: How many accounts will there be on the system? This is an upper bound of the number of people that could access the system if they wanted to.
- Number of Simultaneous Users: Number of accounts that will be accessing the system at the same time. For most applications, at the same time is likely best thought of as in the same 15-30 minutes.
- Number of Customers: For most applications delivered to businesses the number of customers (e.g. businesses) drives the scalability of some parts of the system (such as configuration and data storage) will scale based on the number of customers, not the number of accounts those customers have.
- Data In and Out: If the system is going to have any imports and exports that aren’t user-driven (such as EDI feeds or a public API) then the number of partners (other entities that will exchange information with you) and the frequency of exchange need to be determined.
Things to not bother with:
- Response Time: For customer interactive products, response time is dictated by what end users will tolerate and is not really going to be a business decision (aside from deciding if you’re going to produce something your customers are willing to use). For non-interactive products or back-end this may need more discussion with the business, but again – the business is going to expect you to be able to figure out what will make it a success.
- Data Retention: Assume it all has to be kept and more indefinitely. In the end, storage is cheap and this design decision rarely costs a lot of made up front but is expensive to reverse. Data also has the amazing power to make heroes out of IT when the business starts posing questions later and you can answer them. Generate as many facts as you can now to help you out later.
These items are past the point of diminishing returns with the business. You should work them out within the development team and document them, but you shouldn’t believe that any business sign off you might get is binding or useful.
Build to the Scale
Once you’ve established your growth curves, pick your candidate architecture and translate the growth curves into system performance requirements.
Hypothetical Example: If you need to support 1000 simultaneous users for a web application, determine the dynamic web hits per second by determining how often an average user will request a dynamic page (say ever 5 seconds, which is very fast for most dynamic applications) These two numbers would give you a dynamic hits per second of (1000/5) = 200. Then add how long each page will take to calculate (make a goal of say 250ms) to get how many requests you need to be able to process at the same time: (200 * 0.250) = 50. This is the key scale point for your web application: When deployed, it must support 50 requests being processed in parallel. You’ll need to get to this point by either making it really scalable on a single server, or splitting the load over multiple servers.
One thing that should jump out of the math behind this is that anything you can do to make the calculation time of a single page drop pays big dividends: If you drop the average calculation time by half (125ms) then the number of requests in parallel drops by half (200*0.125) = 25. This in turn may well cut the number of servers you need in half, easing your maintenance and deployment cost. If you can’t do this, reduce the number of dynamic pages requested per second by either making more static pages (such as pre-rendering pages that change but don’t change frequently) or caching dynamic pages that have some predictable consistency (which really makes them static pages). This is often much trickier to do and test, so your best first option is to reduce the time for each page.
Side Point: This also highlights an easy way to accommodate guessing low on a system that’s been in service for a year or more: If you’re processor bound you can replace that hardware with current units and often pick up 30% per year it’s been since you purchased the original hardware. This won’t save you from network problems, disk storage problems, or some memory problems, but it is surprisingly handy.
As you look at each candidate architecture, look at each component and determine the critical “how much, how fast, how often” factors based on the business inputs. If you change your architecture or external interface design (the user interface or import/export capabilities) you need to re-evaluate if you’ve moved the targets as well because your design goals no longer reflect the business growth curves.
Really, to the Scale
Within your development team you will typically have two types of developers you need to watch: Those that never consider scale and those that obsessively consider scale. The former will build it however and then wait to see if there is a performance problem. The latter will try to make every system the next Amazon. Neither situation is good. Identify early people’s tendencies and work to manage them to the center. Remember that the system is only as scalable as its slowest part, and there is always a slowest part.
You can get good results by having the people that are most concerned about scalability move around on the project to different subsystems. This will tend to keep them too busy to earn the keeper of the nanosecond award on any one system (which they will do if you let them stay put and just work on one system) and will make it unlikely that more cavalier developers can hide a problem. It will also help the team learn from each other: It often isn’t worth making a specific feature as fast as possible, and it is always worth thinking about what will make a feature fast before coding it.
Finally, budget time in the development team to fix scalability issues. Regardless of how much work you put into it, once the real system is build and tested you’ll find places that are slower and less scalable than you expected. If nothing else, you need to develop an accurate model of how the system should perform in production so you can check the real world against it later. As your business grows, you need to be able to get ahead of it and understand when it is time to make the code faster, add hardware, or do something else to stay one step ahead.
Disk is Your Friend, but Beware the Network
If you’ve gone over the system from nose to tail and you’re disk bound, you’ve probably optimized that design as well as you can. Disk has gotten faster at a much slower pace than memory or processor, and being disk bound means you’re getting all the requests where they need to go in a timely manner and are able to process the inputs and outputs, so now it’s in the hands of the hardware. Unfortunately at that point there generally isn’t much more you can do: The difference in performance between server drives and the fastest drives money can buy isn’t very much.
If you’re finding that you aren’t disk bound and you aren’t processor bound then be worried. You’re either network throughput bound or you’re network latency bound. If you’re network throughput bound, you can probably fix it cost effectively with some basic engineering either in how you select what to send across the network or what you cache so you don’t need to send it across. You should try to give yourself some headroom here for growth, but faster networks can be purchased and you can generally tweak the software to mitigate this in minor updates.
Being network latency bound is a more serious issue because it often means that you are at the practical scalability limit of your application. The difference in network latency between relatively cheap hardware and the best hardware isn’t very much, and has been essentially constant for the last 10 years. You can’t buy your way out of this problem. It also is typically caused by a badly designed interface between components of the system which will need to be substantially or entirely rethought and rebuilt to address, which isn’t easy to do with a running system. If you find yourself in this situation and you aren’t sure you have met your business goals you should rethink your approach immediately. Because no amount of money on hardware can get you out of this problem, caution is the word of the day.
Tags: Infrastructure, IT Management, performance, Project Management, Scalability, Technology Selection
Posted in Management, Software Development | No Comments »
The Wire Never Lies
Written by Kendall Miller on February 25, 2008 – 12:59 amYou need to find and resolve a problem with your web or multi-tier application, and you need to do it quickly. It may be happening in production or in a place where you can’t easily set up a test environment or get a traditional debugger involved. Here’s an approach that will help you narrow down and in many cases resolve the issue. The best part is that in most cases it won’t require specialized knowledge of the language the application is written in.
Don’t be afraid to pick up a packet sniffer and look at the actual Ethernet packets running back and forth between the parts of your system. You’ll probably find the issue much more quickly than you think, and you can do this with an application in production without the original source code, at least enough to know what your options are. The wire never lies – it tells you exactly what your application is really doing over the network.
For the purposes of this article, consider a basic web application. It most likely has a set of code that runs on the web server (which could be in any language) and then talks with a back-end database, probably located on a different system if this is a large web application. Now take two common categories of problems: A performance issue and an occasional web site error.
Our basic approach is consistently:
- Find the layer of the architecture where the problem is being introduced by tracing the network
- Dissect what is happening in that layer down to the process that is introducing the problem.
- Review the implementation of just the affected commands in the suspect process to resolve the issue.
Our first goal is to narrow down what layer of the system is the most likely culprit – the web application or the database. When doing this, I’ve found that it pays to quickly pull out a tool that will tell me what’s going on across the network. This is where the wire never lies comes from: If you use a packet sniffer or some other tool to see what’s happening “on the wire”, you will know exactly what is going on between your network layers. Not what you think should happen or want to happen – what is actually happening. This is so important because we develop in a world with many layers of abstraction between what we write and the physical I/O commands that ultimately carry out our wishes.
Let’s start with an example of a performance problem, described as being that a user viewing a detail page in your web application is experiencing that it takes several seconds to display, and they believe it is getting slower over time.
Find the problem layer
In our example, we have several possibilities: The user establishes a connection from their web browser to the web server which in turn makes database calls to the database server. If clustering is involved it is somewhat more complicated because with a cluster it likely goes web browser to load balancing appliance to web server to database. Regardless, our first goal is to narrow down what layer of the architecture the problem is being introduced.
In the case of a performance problem, the layer that introduces the problem is the first layer that is taking up the majority of the time and not waiting on another layer. The quickest way to resolve this is to do some strategic network sniffing at key points in your infrastructure to watch the request be processed. This may not seem quick, but with practice it becomes very natural.
A good place to start is on the web server. In many cases sniffing the traffic at the web server alone is sufficient to find the entire problem because it sees the traffic to & from the web browser and upstream to the database server. You can use a variety of tools to do this, but I like Wireshark. It’s free, fast, and very capable. Microsoft also ships a basic network monitor, but it doesn’t have some of the neat-o features Wireshark has that make analysis quick. Until recently, Wireshark was called “Ethereal” but that name had to be changed due to copyright problems.
What we’re looking for is to compare the traffic to & from the web browser and what’s traveling off of the web server. We want to compare timings and volumes to understand what happens between when the web requests starts and when it completes. Do a complete packet capture of one problem web request, then get ready to spend some time understanding it.
The first thing you’ll likely notice is that there is a great deal more information here than you likely expected. Even a simple HTTP Get request results in a lot more network traffic than you might expect. If your site uses SSL, you’ll also discover that in fact the traffic to and from your web browser is encrypted – remember, we’re looking at what’s going on at Layer 2 of the network, so this is a good thing. If you’re using encryption within your own data center from the web server to the database server this is going to really get in your way (and you should ask yourself why you’re doing that as a general practice, but that’s another article). If your web site uses content compression the response will also look encrypted.
When analyzing a trace, do the following:
- Eliminate spurious client traffic: Filter out requests that aren’t from your test client. If they are part of the problem, it will generally still show up in calls the web server is making to the database or other systems, and you don’t need the volume.
- Narrow down the time window: You probably started the trace a few seconds before your hit, and ended it a few seconds after. Look for the first packet from your client’s IP address and eliminate everything before it, likewise look for the last packet To your client’s IP address and eliminate everything after.
- Look at timing: You want to survey the sequence of events to get a feel for what happened exactly in order. Your primary concern is going to be traffic you know could be related (such as to your SQL server) but don’t ignore authentication traffic, it can be a secret performance killer (time spent negotiating security between your web application and another server). Time spent on other servers will show as a quiet spot in the sequence – where a request has been sent off but the response hasn’t come back yet. Note that you need to be reading the timestamps to get a good feel for this; a lot of packets isn’t necessarily a bad thing – networks are very fast in general. If all the packets are happening in the space of 20 milliseconds, it isn’t your performance problem.
- Look at the volume: A quick way to get a feel for this is to use the ability for the packet sniffer to reassemble packets into a stream. This shows you the true conversation that is going on between the layers, and will show you how many bytes were moved to get it done. This is very helpful if you discover, for example, that you’re passing back very large recordsets you didn’t expect. Alternately, it could be that the data is simply inefficiently stored or packaged. For example, if a column in the database is configured for Unicode and the caller requests it in that format, it will take twice as much data across the wire to move it. XML data in the database can also get you in trouble by causing unexpectedly high volume.
- Look at the detail: If the problem isn’t apparent yet, look at the specific requests being made. For example, you may notice repeated requests that may indicate an error/retry cycle in the application.
It’s worth pointing out that a network volume problem like you would in find in step four above will not generally show up if you’re looking at the network interface statistics in your monitoring system because it will only last a few seconds, however it can still be the culprit.
Alternatives to Packet Sniffing
Experience with a packet sniffer is handy because it always works, regardless of the application’s specific technology. Unfortunately, like any generic tool that also means it can’t take advantage of a lot of domain knowledge. If you have a good reason to suspect you know what layer the problem is in or don’t feel comfortable jumping down to the wire right away, you can take advantage of a few other tools in specific circumstances
Web Server Traffic Logs
If you have the optional extra information being captured into the web server logs beyond the original NCSA spec it should include the time it takes to transfer the data to the client and the number of bytes transferred. This should be enough to either validate or exclude the link between the web browser and the server. You’re looking in the logs for just a few things:
- Repeated Request Patterns: Web browsers try really hard to not fail. They will automatically and quietly respond to redirect requests and some other HTTP status codes and attempt to authenticate before throwing in the towel. This will show up in the logs as a pattern of hits in rapid succession from the same client IP address. You may have a situation where a client is being sent through several redirects, or is getting a retryable HTTP error on the first hit.
- Response Time: Look at the number of bytes transferred and the time to transfer to the client as well as the total request time. Compare the time to transfer with the total transfer time to exclude the client to web server link as being the performance problem.
SQL Server Profiler
An alternate approach if your application is a heavy SQL application is to use the SQL Profiler to get a nicer view of what is happening at the SQL level. This is worth it if you have high confidence that the problem is going to be in evidence by inspecting the SQL commands executed by your software. If you aren’t sure, start with a network trace anyway because you can establish some degree of confidence quickly whether or not it’s a lower-level problem.
Side note for Developers
If you’re writing code that makes calls to the database, it’s worth it to run through your main use cases and use SQL Profiler to verify what is happening at the database level. I guarantee it’ll be an eye opener. In particular, watch for events that don’t necessarily cause your code to break but are signs things aren’t entirely right. For example:
- Excessive database connect/disconnect: You’d be surprised how expensive this can be. From a pure performance standpoint, you ideally want to see it reuse a pooled collection, make all of its calls for that request, and then be done. If you see a lot of poll collection resets or even worse real database connect and disconnect events this should be investigated.
- Database deadlocks: Many developers automatically retry database exceptions to handle the wide range of use cases where a temporary issue (such as a missing or unusable database connection) occurs. This can also generally recover from deadlocks, but deadlocks are a performance killer. You should investigate them every time.
- Unexpected calls: You should have a mental picture of what database calls are going to be made and how many rows should be returned from the queries (at least approximately). If you can optimize your code to reduce the number of calls, it’s most likely worth it. Each call will add linear time to your application which will tend to create performance issues. You can’t beat the speed of light.
Side note for DB Administrators
If configured correctly, the profiler has a very low load on the SQL Server. In the past, my IT team has gotten value out of periodically setting up a trace to run for a day in production looking for particularly problematic events like database deadlocks.
You have monitoring, right?
You can often eliminate a lot of possibilities if you have routine system metrics being captured. Check out our list of recommended metrics to capture. If you’ve done this, you can out of the box eliminate problems such as network interface saturation, processor, or memory utilization on the different servers and appliances in the system. If you’re having a performance problem and have one of these issues, you’ll want to start there. In the case of a network problem you will want to either do a network capture to see what is using the bandwidth or, if you have it, take advantage of SFlow or NetFlow to get to the same results without having to look at the packets.
Installing Wireshark
Wireshark works by installing a packet capture service and using it to intercept all of the traffic as it’s coming off the wire and going on. It uses a separate service on windows to do this – WinPcap. Because it installs as a system driver, you may legitimately be uncomfortable installing it on a production system that’s live. In my past experience, we’ve often left Ethereal installed locally on one node in each web cluster (and indeed in several cases just integrated it into our standard server installation) with no ill effects.
If you don’t want to run Wireshark & WinPcap on your production server, there is another option but it’s somewhat tricky to set up: Most commercial network switches support port mirroring that will allow you to configure one switch port to get all of the traffic received & sent by another port. You can use this then to set up an interface on your test system that is mirroring the network port used by the server you want to monitor. There are several downsides with this in my eyes: First, you’re changing the configuration of your production switches and possibly moving cables around (there are often restrictions on using port mirroring across switches, so depending on your physical hardware you will usually have to plug into a free port on the same switch as the server you want to monitor) which in many ways invites more human error potential than leaving another system driver installed. Second, if you forget to change the configuration back when you’re done and someone else plugs into that switch port they’re in for a surprise.
Credit where it’s due
I have to give Ingo Hammer credit for introducing the phrase the wire never lies in a presentation he gave at Tech Ed 2005. His presentation led me to have my team reorder our troubleshooting process to take what had been a late step and move it way up in the process – using a packet sniffer to see what was going on at the physical network level when troubleshooting system problems. Much of this article is an elaboration on what he talked about with my team’s experience added in. I can’t find a copy of his original presentation online, but if someone knows where it is posted (legally) let me know.
Tags: Ethereal, Packet sniffer, performance, SQL Server Profiler, wire never lies, Wireshark
Posted in Monitoring | No Comments »