Continuous On-line Condition Monitoring Is Best Way To Manage Data Center Maintenance

By Terry L. Rodgers  
OTHER PARTS OF THIS ARTICLEPt. 1: Data Center Availability, Reliability Hinge On Numerous FactorsPt. 2: Data Center "Black Boxes" Can Complicate Operations and MaintenancePt. 3: This Page

The absolute gold-plated best practice is continuous on-line condition monitoring. For filters, just install a differential pressure transducer, monitor with the building management system (BMS), and assign an appropriate alarm threshold. For vibration, install permanent accelerometers and multiplex them to a dedicated server capable of doing vibration analysis, archiving, and trending along with the ability to provide remote alarms. Obviously, the use of continuous on-line vibration analysis would be done on a select and judicious basis for only the most vital equipment and processes. The obvious compromise is to determine the optimal frequency at which on-line condition monitoring is performed.

Operating Staff

Although the data center industry has been successful at "hardening" facilities and physical infrastructure, it has not done as well with the associated operating staff and facility management aspects. It is widely recognized that the vast majority of critical facility problems can now be attributed to human error (some sources claim as high as 70 to 80 percent of problems). There has been a direct correlation between the increase in infrastructure complexity and the increase in human error by the operating staff.

The problem here is not one of availability. Most critical facilities have staff on-site continuously (100 percent availability). The problem is in staff reliability (and in some cases validity). Unlike computers, people get tired, distracted, sick, confused, etc., which can all lead to unreliable performance.

The answer is to have processes that produce reliable results. Detailed, step-by-step procedures are a good example, but unless they are followed correctly each time there is no guarantee of a reliable outcome. On the other hand, if operating staff is required to initial a checklist as each step is completed, and to have the action witnessed by a separate participant, the reliability of the process improves greatly. If the procedure also describes the expected outcome or result associated with each step, such as the expected pressure and flow when starting a pump, or the expected indicating lights and annunciations when closing a breaker, then the validity of the process is ensured. Operating staff who blindly follow a procedure without equal attention to the results will inevitably produce unintended outcomes.

Performance-based training is also a process. Training a new computer or controller is simple and quick. Download the programming, connect to the network, and the new computer is 100 percent as capable as the failed computer or controller that it replaced. Training new staff isn't quite as easy. People are individuals. Each of us is a one of a kind.

Operating staff who attend a sequential set of classes that teach a base level of skills and knowledge will perform more reliably than untrained staff. Training needs to be validated as well through quizzes, tests, and qualification exams that require students to demonstrate comprehension and ability to perform. The best practice is to have training culminate in staff being tested and certified before allowing them to perform their duties and responsibilities unsupervised.

Terry L. Rodgers, CPE, CPMP, is vice president, sustainable operations services, Primary Integration Solutions, Inc.


Personnel, Planning and Preparedness

Keeping the machines running is important for data center reliability, but the people running the machines matter, too. (For more on the human element, see page 40.)

  • Staff performance improves with site-specific training.
  • Proactive planning and preparedness is more reliable than reacting when caught by surprise.
  • Highly motivated staff perform better than "just get me through the day so I can go home" staff.
  • Safety training and safety audits reduce accidents and injuries (which can also result in outages and damaged equipment).
  • Over-communicating is better than under-communicating, but accurate communication is critical either way.
  • Random spot-checks, surprise inspections, and unscheduled audits reveal more than planned inspections and audits.

— Terry L. Rodgers

Continue Reading:

Data Center Availability, Reliability Hinge On Numerous Factors

Data Center "Black Boxes" Can Complicate Operations and Maintenance

Continuous On-line Condition Monitoring Is Best Way To Manage Data Center Maintenance

Contact FacilitiesNet Editorial Staff »

  posted on 8/14/2013   Article Use Policy

Related Topics: