Latest Posts »
Latest Comments »
Popular Posts »

Aviate, Navigate, Communicate

Written by Kendall Miller on March 27, 2008 – 12:29 am

If you’re involved in IT operations or even in business long enough, you’re going to experience some emergencies. During these emergencies, you’re going to have to balance several conflicting things that will demand your attention simultaneously:

  1. Cause of the problem: What is really happening? What device is at the root of the problem (network switch died because an admin configured a loop in the fabric and miss-configured the port)
  2. Scope of the problem: Just how bad is it? Problems usually show up in one place (users can’t access Exchange) but those symptoms often represent a larger problem (network switch died)
  3. Communicate with users: First, people will be coming in the door to report the problem (do you know that Exchange is down?) and will be expecting updates on what’s going on and when it’ll be resolved (I really need to tell my friend about a party tonight, when will email be back up?)

Even in a shop with healthy staffing, this can be a lot to handle at once particularly because your impulse is going to be to move between the root cause and communication. The first because it’s the real high value item -fix the problem. The last because whenever someone walks in, you’ll want to tell them what’s going on. The higher up the chain of command, the better you’ll want it to sound.

Whenever I’m wondering how to look at an IT Operations problem from a different perspective to gain insight, aviation is the first place I go. Think about the modern air transport system in the United States not from your usual perspective (a passenger on a plane) but from the standpoint of the people that live within it and operate it. For example, the life of a flight deck crew isn’t that different than system support in the sense that you have long periods of routine punctuated by periods of high stress activity. A classic rule taught to pilots when they’re first being trained is Aviate, Navigate, and Communicate – in that order.

  1. First, fly the plane. (Be in the middle of the air, not the bottom)
  2. Figure out where you are. (Over the White House)
  3. Then communicate. (Sorry Tower, would you like us to land?)

To make things easier on commercial planes, you have a pilot and co-pilot that divide these responsibilities by having clear designation of one being the Pilot Flying and the other (called the Pilot Not Flying or Pilot Monitoring) responsible for navigation and communication. This is practiced carefully during training with different parts of each emergency checklist assigned to either the Pilot Flying or Pilot Monitoring.

Now apply this back to a system problem:

  1. Create Clear Roles: Have your team know who is going to take on the role of Admin Flying and Admin Monitoring. This shouldn’t always be the same – it may be based simply on rotation (who is “up”) or who gets the trouble ticket or whatever within your shop. The team should declare their role in a situation so everyone knows their role.
  2. Perform in Order: If you have an Admin monitoring, it’s their role to intercept external communication while the Admin Flying is working on the problem.
  3. Make a Checklist: When there is an emergency isn’t the time to be winging it. During quiet moments, talk as a team about what you would do in a hypothetical situation and work to distill out a basic checklist of things you’re going to run through. Focus on having it be the shortest list that verifies the largest set of items. When a problem shows up, use the checklist.

Problem Checklists

There are a few great advantages to using a checklist for problems:

  • Reduce Solution Focus: When diagnosing problem, the general process is to propose a theory then test it to either prove or disprove it. This create cycles where you create theories you have to believe in then your job is to prove yourself wrong. It turns out that people tend to naturally bias towards information that proves themselves right and away from information that’s inconsistent with that diagnosis. Checklists for diagnostics can ensure that a significant breadth of information is available at the start of this process to enable the best theories to be created quickly.
  • Creates a Pace: It’s easy to get caught up in an emergency and start working at a pace that really isn’t necessary, but degrades your accuracy and effectiveness. Checklists stop the emotional cycle that reinforces the early stages of emergencies and instead create a steadily paced environment of gathering and verifying facts.
  • Establish a Baseline for Improvement: One of the most important parts of any emergency, and the least frequently used effectively, is an after action review. After you’re back up and everyone has calmed down, you want to learn as much as you can from what happened. The existence of a checklist creates a baseline for systematic (As opposed to random or by chance) improvement to your team’s ability to handle future problems. This is true even if the checklist wasn’t used; the fact it wasn’t used is itself an indictment of either the checklist itself or the team’s training.

While initially it may feel corny or even overly dramatic or bureaucratic to create checklists, there is real evidence to back up using them in environments where the downside cost (crash and death) is very steep, and if pressed to admit it most engineer will confess they have a mental checklist they use for standard problems.

Plans are Useless, Planning is Priceless.

Just by creating the checklists (even if they were never used) your team can get a lot of value:

  • Cooperative learning: This is a great tool for the team to learn from each other. Each admin will share their best tips and tricks from their mental checklist and be surprised that they don’t line up. Where they don’t, the discussion on which approach is better and why is gold. It’s hard to get the same result with a contrived exercise, so use this opportunity to build the checklist and maintain it as a team.
  • Clarifies Automation: While creating the checklist, it will naturally precipitate ideas for how to automatically identify and possibly solve steps in the checklist itself. For example, if a step in the checklist is to verify Internet connectivity, how are you going to accomplish that? Instead of having an ad-hoc mechanism, can an automated mechanism be put in place so that you now can quickly check that data point without variation?
  • Encourages Collaboration: If the team collaborates to create the checklist, when a problem occurs they will be more likely to collaborate on resolving the problem because they already have had the experience of working together as a team. This will tend to replace individual ego with group esprit de corps.

An Exercise Left to the Interested Student

A friend of mine also pointed out the principle that if you have a checklist that always ends in the same action, why not automate the action in response to the checklist? In other words, if you can automate the detection steps that lead up to the action, then find a way to automate the resolution. You will often find you get here in inches: You progressively improve your monitoring so that you can find problems faster. Once this is reliable, you start just hooking up alarms to the monitoring so you don’t wait for a call from a real user or a higher level system. Once that’s working well enough, you get tired of performing the resolution manually so you write a script that takes a few arguments to perform the resolution. Now, just connect them together.

Move Forward One Step Today

The best part about this is that you can get there in small steps that even the busiest team can fit into their schedule with a confidence that they will pay back in time saved in the future. With practice, it will become second nature and make it easier for your team to accommodate new processes and service requirements with ease. In the end, isn’t that what you need to ensure your team is viewed as a vital part of your organization?

Bookmark and Share

Tags:
Posted in Management, Monitoring | No Comments »

Leave a Comment