4 FM quick reads on critical facilities
1. Pay Close Attention To Staffing Challenges In Critical Facilities
One key consideration to reliable operations in critical facilities is the challenge of staffing issues. Although the data center industry has been successful at "hardening" facilities and physical infrastructure, it has not done as well with the associated operating staff and facility management aspects.
It is widely recognized that the vast majority of critical facility problems can now be attributed to human error (some sources claim as high as 70 to 80 percent of problems). There has been a direct correlation between the increase in infrastructure complexity and the increase in human error by the operating staff.
The problem here is not one of availability. Most critical facilities have staff on-site continuously (100 percent availability). The problem is in staff reliability (and in some cases validity). Unlike computers, people get tired, distracted, sick, confused, etc., which can all lead to unreliable performance.
The answer is to have processes that produce reliable results. Detailed, step-by-step procedures are a good example, but unless they are followed correctly each time there is no guarantee of a reliable outcome. On the other hand, if operating staff is required to initial a checklist as each step is completed, and to have the action witnessed by a separate participant, the reliability of the process improves greatly. If the procedure also describes the expected outcome or result associated with each step, such as the expected pressure and flow when starting a pump, or the expected indicating lights and annunciations when closing a breaker, then the validity of the process is ensured. Operating staff who blindly follow a procedure without equal attention to the results will inevitably produce unintended outcomes.
Performance-based training is also a process. Training a new computer or controller is simple and quick. Download the programming, connect to the network, and the new computer is 100 percent as capable as the failed computer or controller that it replaced. Training new staff isn't quite as easy. People are individuals. Each of us is a one of a kind.
2. How To Limit Human Error In Critical Facilities
Over the last 15 years, most building operators have come to recognize people account for the majority of interruptions to critical operations. Human error is identified as the root cause in 60 percent to 80 percent of data center downtime events, year after year.
Those who maintain the critical facility's infrastructure systems require written procedures to consistently carry out riskier activities such as system transfers, when system redundancy is reduced as equipment is brought off-line for maintenance or repair. Just as important are procedures for resolving emergency scenarios. A critical facility may require 150 to 200 documents to cover both of these categories, due to the number of infrastructure systems involved. This number seems high when compared to a non-critical facility's needs. However, by comparison to another critical endeavor, it is roughly one-fifth the number of procedures required for operating a nuclear submarine.
In all cases, procedures need to be site-specific, as each facility's configuration is unique. One individual on the facilities staff must be assigned the role of procedures owner and be provided dedicated time each month to make continual progress with the program. Typically, the procedures owner is provided a contracted resource to get the program started.
Written processes are much more important when addressing areas where personnel from multiple departments have access. In the case of a data center facility, the computer room is most critical. Tasks performed there present the greatest risk of error, because multiple departments are involved and a higher frequency of human activity occurs within the room.
To reduce the high potential for error when multiple groups work together in one space, it is necessary to develop written mutual expectations between the departments involved. Some organizations refer to these as internal service level agreements. The documents can be as simple as one page, but must be endorsed by each department head and be consistently enforced.