Latest Posts »
Latest Comments »
Popular Posts »

Two Person Rule

Written by Kendall Miller on March 3, 2008 – 10:46 pm

Whenever working on the components of a high reliability system, remember that the biggest single cause of availability problems are people - generally through clicking the wrong thing, typing the wrong instruction, or not seeing the consequences of an action. A good procedure to minimize the risk of unintended harm while working on an important system (whether it’s clustered or not) is to have two people involved in the physical work. It’s the IT Operations equivalent of pair programming. For example, if you are taking a cluster node offline you want to be sure you take the right one offline. Even in Aviation where there are good procedures to avoid mistakes like this, it still happens and can cost lives. Your situation isn’t as dire, but the principle remains the same: When performing operations that can directly impair your availability, use an obvious two person structure to make sure you do the right thing:

  1. Say what you’re going to do.
  2. Have the second person confirm that it’s the right thing and you’re on the right one.
  3. Perform the action.

It may feel pedantic, but it will keep you focused on what you’re doing and ensure you don’t have to explain why you deactivated the perfectly good node of the cluster. The principle works whenever you’re doing something that has the potential to impact your availability.  It also provides good cross-training experience with the less-experienced person driving and the more-experienced person looking ahead to the larger tasks.  Unlike pair programming, it really isn’t necessary to switch roles through the process.  Instead, consider it more like pilot and navigator with the navigator referencing checklists, procedures, and verifying selections and the pilot performing each action.


Tags:
Posted in Infrastructure |

Leave a Comment