Posts Tagged ‘Troubleshooting’
First, Fly the Plane
Written by Kendall Miller on March 16, 2008 – 8:45 pmI used to work with a former Navy A-6 pilot and instructor. One of his standard techniques for helping pilots deal with emergencies was to train them to take an immediate action when they noticed the problem - an action that had no consequence but would fill the need to do something. What he trained them to do was reset the built-in timer clock as soon as they noticed the problem. Ostensibly, this was to help them downstream know how long a problem had happened, but its true purpose was to give them a single, standard action to fill the human need to do something, then they could take time to reflect on the problem. Step two on the checklist was fly the plane. There have been several CFIT accidents where pilots were too busy troubleshooting a problem to avoid the ground. The pilots forgot their first responsibility: make sure you put flying the plane in front of any other activity.
When doing IT Operations, there’s a lot you can learn from aviation. I’ve seen several situations where technicians have caused much larger problems while troubleshooting small ones. This comes from the same mindset that caused air crashes: you become so focused on the immediate problem that you are no longer aware of your environment. The longer you work at a problem, the more likely this will happen.
A few team techniques you can use to help avoid this:
- The Two Person Rule: Have two technicians involved in the problem with one taking the immediate actions and the other taking a longer view.
- Separate Diagnostics from Remediation: Break your approach into non-invasive diagnostic activities before remediation attempts. This gives you a discrete point before you start putting thing at risk to recheck your assumptions about dependencies and risks to other systems.
- Peer Review: Before approaching a problem, discuss your approach with two other people on your team (at the same time). If that approach isn’t successful or you need to deviate from it, reconvene the group to discuss again.
In many ways this is an extension of Don’t Taunt the Bear. When working on a problem during business hours (or, if you like, non-maintenance hours) before taking anything off line, even for a moment, ask yourself: Do I need to take this action right now? How sure am I that it won’t have any unexpected consequences? Is the risk I’m wrong worth the benefit of doing this right now?
All of this may sound like it’s going to add time to problem resolution, and it might - however remember that your first responsibility is to keep services flowing to your users. Most users will be unsympathetic if they lose access to their home directories because you were troubleshooting a problem with the printer in accounting and took down the same services that shared files.
Tags: CFIT, IT Operations, Troubleshooting, two person rule
Posted in Infrastructure | No Comments »