Data Center "Black Boxes" Can Complicate Operations and Maintenance

By Terry L. Rodgers  
OTHER PARTS OF THIS ARTICLEPt. 1: Data Center Availability, Reliability Hinge On Numerous FactorsPt. 2: This PagePt. 3: Continuous On-line Condition Monitoring Is Best Way To Manage Data Center Maintenance

The bad news is that computers, programmable logic controllers, device-specific controllers, etc., are essentially "black boxes," which can complicate data center operations and maintenance. They typically don't give advance notice of pending failure, and when they do fail, the operating staff cannot make repairs or replacements. They have to call for vendor support and take manual control of the infrastructure involved. When the affected infrastructure is a component or single piece of equipment, the manual intervention is probably fairly straightforward, but if the controller is managing a central utility such as a central chilled water plant, then the operating staff will probably have their hands full. Some critical facilities now have "mirrored-redundant" controllers with looped communication paths that address these control-related single points of failure.

This issue of complexity and computers resides within much of today's equipment. Just open the panels and cabinets of uninterruptible power supply (UPS) units and paralleling control cabinets, chiller control panels, paralleling switchgear, etc., and look inside. To most operating staff, this equipment has essentially become black boxes as well. So as the infrastructure has outpaced the staff's ability to troubleshoot and repair, the reliance on good maintenance practices becomes even more crucial.

The basic purpose of maintenance is to increase the availability of the equipment (and systems) being maintained. At the bottom of the pile is "corrective maintenance," or simply put, "fix it when it breaks." It takes the least effort from a management perspective, but results in the lowest availability and in most cases ends up costing the most in both total cost of ownership (TCO) and impact to operations.

The next rung up is preventive maintenance where you (hopefully) follow the manufacturer's recommendations to inspect and care for the equipment to extend its life and optimize its performance. In this case, you live with some planned unavailability (shutdowns) to afford the opportunity to care for the equipment (check belts, change filters, torque connections, etc.). The result is increased lifespan, more reliable performance, and lower failure rates.

The best practice is to supplement a preventive maintenance program with predictive maintenance using on-line condition-monitoring technologies. The most common and valuable on-line condition-monitoring technologies are thermography (infrared scanning) and vibration analysis. These monitoring techniques not only provide incredible insight regarding the health of the equipment, but actually require the equipment to be in operation, so the need for outages is reduced. By trending the results over time, a facility manager can see the health of the equipment start the inevitable decline towards predefined thresholds and "predict" when the equipment condition or performance will be adversely affected. This allows for planned adjustments and repairs and significantly reduces unanticipated outages and impacts.

In some situations, predictive maintenance can also reduce the required level of preventive maintenance. A simple example is changing filters. Preventive maintenance would require filter changes based on time regardless of how dirty a filter actually is. This can lead to unnecessary filter changes when the filter still has useful life left. Worse, it can allow dirty filters to remain on-line and affect the performance and possibly life of the equipment, systems, or processes they are intended to protect. Predictive maintenance would monitor the loading of a filter by measuring the pressure drop and then initiate replacement based on actual loading.

Another valuable aspect of on-line condition monitoring is the ability to quickly identify sudden changes in operating conditions. Not all failure mechanisms are gradual or linear. Sometimes typically stable conditions degrade quickly and unexpectedly. For example, an anchor bolt shears off and suddenly a smoothly running pump or fan loses alignment; or there is a sudden increase in outdoor contaminants (leaves, dust, plastic bags, etc.) that enter a ventilation fan, and the associated filter gets blocked. Or suppose the load on a breaker increases, and suddenly internal components start operating at damaging thermal levels. When next applied, on-line condition monitoring will identify these issues.


Maintenance and Commissioning

To keep your data center reliable and available, pay close attention to maintenance and commissioning.

  • Preventive maintenance is better than corrective maintenance, but preventive maintenance coupled with predictive maintenance based on on-line, condition-monitoring technologies is better still.
  • Commissioning and acceptance testing identify and resolves latent failures before the failures result in operational impacts to the mission.
  • Re-commissioning, continuous commissioning, and periodic retesting of equipment and systems improves operational reliability.
  • Deferred maintenance and delayed repairs allow combinations of discrepancies and deficiencies to align and result in the "perfect storms" that overcome the best redundancy schemes.

— Terry L. Rodgers

Contact FacilitiesNet Editorial Staff »

  posted on 8/14/2013   Article Use Policy

Related Topics: