Eliminating Human Error in the Data Center
February 21, 2019 - Critical Facilities
By Andrew Salcido
If you ever "liked" something on Facebook, updated your work experience on LinkedIn, or bought a book on Amazon you've interacted with an application hosted in a data center. When you consider all the activities that you perform from your keyboard on a daily basis, it's obvious that data centers' ability to ensure that we can access all things internet at any time has set an expectation of operational availability similar to a light switch. We don't appreciate our level of dependence of their continuous operation until they stop operating. As a result, a whole lot of people have a vested interest in making sure that a data center never goes down.
Data center operators have found that the greatest threat to continuous operation are the same people responsible for ensuring that nothing breaks. In more than 70 percent of cases, the cause for a service-disrupting incident within a data center is human error. By working with our customers and analyzing service data, we were able to determine that there are two primary reasons that a technician hits the wrong button or neglects to perform a routine maintenance activity. The first of these is a lack of, or insufficient, training of on-site personnel with the second element of this deleterious combination being the need to use instructional documentation that is not conveniently available or easy to follow.
In a typical data center environment, operator training is a hodge-podge of one-time instruction performed by the vendors of individual site components that provides technicians with a cursory understanding of the equipment's operations and maintenance requirements. The superficial nature of this vendor-driven mode of training results in a support staff with unequal levels of knowledge regarding the site’s operational systems as a whole, and no method to increase the level of mastery of all the site's support engineers. To fill this void in effective technician training, we created the Compass Learning Management System (CLMS).
Our goal of the CLMS was to create an end-user driven, web-based tool that would enable personnel of varying levels of experience to understand the support and maintenance operations from both a holistic systems perspective as well as the individual component level. The CLMS curriculum uses a hierarchical structure of modules that provides a base level of information for all "students" and enables even the most inexperienced technician to understand every aspect of the facility. Since the system is designed to allow each individual to proceed at his or her own pace, technicians are free to acquire increasing levels of mastery by completing subsequent modules that address more complex operational issues.
The Japanese concept of Poka Yoke describes a philosophy under which the goal of documenting any operation is to ensure that it is quickly followed and replicable. A more caustic summary of the concept would be to say that all operational instructions should be "idiot proof." Regardless of the explanatory verbiage, data center documentation typically isn't a product of Poka Yoke. In many instances, a technician attempting to troubleshoot a problem is forced to use one or more 3-inch-thick, ringed binders to diagnose and correct the issue. Since trying to re-start a generator while holding an instructional manual the size of a phone book opens up, rather than reduces, the prospect of human error, we worked with Icarus Ops — a company experienced in the creation of digital checklists — to create what we refer to as our Error Elimination System (EES).
Within the EES every operation that a technician would be required to perform, from routine maintenance to problem identification and correction, has been converted into a step-by-step digital checklist. The operator must confirm they have completed a step before moving forward to the next action. These lists also provide visual notification of potentially hazardous tasks and the necessary precautions to be taken to complete them. Every checklist is accessible via an Android or handheld tablet, and also offers access to videos or other supporting information to aid in performing a specific operation.
The critical nature of data centers in supporting our rapidly growing electronic economy has made their ability to operate without interruption more imperative than ever. Since the weakest link in a facility's operational chain is its support personnel, the elimination of the potential for human-caused error is essential. In our experience, achieving this task requires an integrated approach that combines user-driven education and easy access to step-by-step operation instructions. The combination of these educational and instructional elements is mutually reinforcing. In the future, it is probably not hyperbolic to see this marriage of instruction and wearable technology used across the building operations' spectrum.
Andrew Salcido is vice president of operations for Compass Datacenters. Prior to joining Compass, he served as vice president of operations at T5 Data Centers, and director of operations for Vantage Data Centers and Digital Realty Trust.