Posts Tagged ‘two person rule’
First, Fly the Plane
Written by Kendall Miller on March 16, 2008 – 8:45 pmI used to work with a former Navy A-6 pilot and instructor. One of his standard techniques for helping pilots deal with emergencies was to train them to take an immediate action when they noticed the problem - an action that had no consequence but would fill the need to do something. What he trained them to do was reset the built-in timer clock as soon as they noticed the problem. Ostensibly, this was to help them downstream know how long a problem had happened, but its true purpose was to give them a single, standard action to fill the human need to do something, then they could take time to reflect on the problem. Step two on the checklist was fly the plane. There have been several CFIT accidents where pilots were too busy troubleshooting a problem to avoid the ground. The pilots forgot their first responsibility: make sure you put flying the plane in front of any other activity.
When doing IT Operations, there’s a lot you can learn from aviation. I’ve seen several situations where technicians have caused much larger problems while troubleshooting small ones. This comes from the same mindset that caused air crashes: you become so focused on the immediate problem that you are no longer aware of your environment. The longer you work at a problem, the more likely this will happen.
A few team techniques you can use to help avoid this:
- The Two Person Rule: Have two technicians involved in the problem with one taking the immediate actions and the other taking a longer view.
- Separate Diagnostics from Remediation: Break your approach into non-invasive diagnostic activities before remediation attempts. This gives you a discrete point before you start putting thing at risk to recheck your assumptions about dependencies and risks to other systems.
- Peer Review: Before approaching a problem, discuss your approach with two other people on your team (at the same time). If that approach isn’t successful or you need to deviate from it, reconvene the group to discuss again.
In many ways this is an extension of Don’t Taunt the Bear. When working on a problem during business hours (or, if you like, non-maintenance hours) before taking anything off line, even for a moment, ask yourself: Do I need to take this action right now? How sure am I that it won’t have any unexpected consequences? Is the risk I’m wrong worth the benefit of doing this right now?
All of this may sound like it’s going to add time to problem resolution, and it might - however remember that your first responsibility is to keep services flowing to your users. Most users will be unsympathetic if they lose access to their home directories because you were troubleshooting a problem with the printer in accounting and took down the same services that shared files.
Tags: CFIT, IT Operations, Troubleshooting, two person rule
Posted in Infrastructure | No Comments »
Two Person Rule
Written by Kendall Miller on March 3, 2008 – 10:46 pmWhenever working on the components of a high reliability system, remember that the biggest single cause of availability problems are people - generally through clicking the wrong thing, typing the wrong instruction, or not seeing the consequences of an action. A good procedure to minimize the risk of unintended harm while working on an important system (whether it’s clustered or not) is to have two people involved in the physical work. It’s the IT Operations equivalent of pair programming. For example, if you are taking a cluster node offline you want to be sure you take the right one offline. Even in Aviation where there are good procedures to avoid mistakes like this, it still happens and can cost lives. Your situation isn’t as dire, but the principle remains the same: When performing operations that can directly impair your availability, use an obvious two person structure to make sure you do the right thing:
- Say what you’re going to do.
- Have the second person confirm that it’s the right thing and you’re on the right one.
- Perform the action.
It may feel pedantic, but it will keep you focused on what you’re doing and ensure you don’t have to explain why you deactivated the perfectly good node of the cluster. The principle works whenever you’re doing something that has the potential to impact your availability. It also provides good cross-training experience with the less-experienced person driving and the more-experienced person looking ahead to the larger tasks. Unlike pair programming, it really isn’t necessary to switch roles through the process. Instead, consider it more like pilot and navigator with the navigator referencing checklists, procedures, and verifying selections and the pilot performing each action.
Tags: two person rule
Posted in Infrastructure | No Comments »