Dual/redundant EPO systems can cut the risk of erroneous EPO shutdowns; so can bypass circuits, alarmed covers, strobes, and time delays.
8 Ways To Bring Down Data Centers
To keep data centers operational, don't let operations themselves be the Achilles' heel. And sometimes, less obvious risks are the hardest to anticipate.
Facility managers responsible for critical facility operations are constantly concerned with electrical power reliability. Preventing unplanned downtime is paramount. And with 24/7 availability increasingly a requirement, reducing and eliminating planned downtime is almost as important.
Most critical electrical systems operate reliably even with many details not right. Ironically, the more redundancy built into a system, the more likely that latent defects can lurk unnoticed. The equipment and systems are often robust enough to soldier on with maladies that may or may not be apparent. Even with everything right and functioning correctly, sooner or later something will break. But when enough things are wrong, and find themselves lined up to compound each other, the result is what’s often described as a “comedy of errors” that brings the system down. Sometimes it’s difficult to determine exactly which straw broke the camel’s back.
Root causes of power reliability vulnerabilities range from obvious to hidden causes. Here are some of the more common critical power failures:
1. A standby/emergency generator fails to start when utility power fails for more than a few seconds, or generator transfer switchgear fails to transfer, or the generator fails to run until utility power is restored. Causes are numerous: allowing equipment to remain in alarm, or not in auto; allowing the generator circuit breaker to remain open; allowing a battery charger failure or charger power off/failure to persist; aging engine-cranking batteries; diesel fuel contamination — the list goes on.
2. A UPS system randomly fails even though utility power is good, and then fails to successfully transfer to bypass. Causes include aging capacitors, incorrectly adjusted bypass breakers, lack of maintenance, factory-directed field modifications that have not been performed, etc.
3. A UPS system or batteries fail when utility power fails, and the generator (if available) has not yet started and run up to speed, or during transfer between generator and utility sources. Causes include allowing the UPS system to remain in bypass or in alarm, or with low- or failed-battery warnings, aging batteries, etc.
Ways to reduce risk from more common failure scenarios include designing in effective redundancy; purchasing, installing, and commissioning quality equipment; providing periodic maintenance; and regularly performing load, transfer, and failover testing. All of these elements leading to reliable power require attention to detail and efforts over and above those typically provided in the competitive marketplace.
Less common power issues don’t get due respect. These can not only cause critical power failures but, if a failure occurs, root-causes can remain undetected, waiting to strike again.
4. Circuit breaker nuisance tripping. Thoroughly tested, properly coordinated and adjusted breakers can trip at current much lower than their rating. Sometimes a breaker with adjustable trip settings is replaced by an electrician and the settings are not properly adjusted, or not adjusted at all. Often, everything works fine in a new, lightly-loaded system; problems don’t occur until load is added.
A breaker that trips for no apparent reason, then is reset without investigation, is waiting to nuisance-trip again. Redundant circuit breakers, which often require redundant switchboards, eliminate reliance on a single breaker.
5. Circuit breakers installed in 24/7 live switchgear that have not been recently cycled (opened and closed) or energized (design voltage applied) or tested can be problems waiting to surface at the worst time; this applies (maybe even more so) to small, new breakers out of the box. Breakers that won’t close or reclose, breakers that won’t open, or breakers with internal open or shorted poles can present life-safety and downtime risks. A breaker of unknown service condition in a critical application, whether new or old, loose or installed, off or on, loaded or unloaded, should always be operated carefully, and tested as much as feasible prior to installation and energization.
6. Neutral and grounding issues. Almost every building operates with many incorrect wiring and bonding details. The neutral can be bonded to ground only where power enters the building or where there are newly created sources (transformers, generators, UPSs, etc.) per safety codes. But illegal neutral-to-ground bonds at switchgear, outlets, and equipment are common.
Electrical switchboards often ship from factories with a neutral-to-ground bond jumper installed; the installing electrician has to remove this jumper if it is not part of the design. This detail is too often overlooked.
Neutral and ground conductors that get switched, as well as missing, loose, or corroded neutrals and grounds, are other common issues.
Sometimes the problem is tricky, such as a 4-pole ATS that does not switch the neutral in the correct timing sequence. In many of these cases, equipment will remain operational without indicating any alarm or concern. But life-safety and performance risks are present. High ground current, ground-fault interrupter trips, electrical shocks, or sparking should be investigated.
7. Bypass and transfer mechanisms that have not recently been operated, or operated under load, or operated at all (even many years after installation) should be operated carefully and tested as much as feasible prior to operation. Check for proper voltages, phase rotation, and expected open/closed status on all three phases, as much as can be done safely. Also check for current sharing through both power paths on all three phases (this often can be observed with installed metering) during transfer, at the step where the backup source has been connected but the primary source in parallel has not yet been disconnected.
8. Emergency power off (EPO) circuitry for 24/7 live facilities, which have not been recently tested or where validated (trusted) wiring diagrams are not available, should be treated carefully. Carefully “defanging” (removing control wiring) at each connection point (at UPS units, cooling units, circuit breaker shunt-trip points, etc.) prior to testing or modification will reduce or eliminate the risk of inadvertent shutdown of critical equipment. Newer safety codes have relaxed the need for many EPO systems. Codes typically don’t require external shutdown signals, such as those from fire control systems; however, these are often installed anyway by well-meaning technicians. If an EPO cannot be eliminated, the risk of erroneous EPO shutdown can be reduced by installing bypass circuits, alarmed covers, strobes, time delays or dual/redundant EPO systems.
Incident tracking, regular incident review, and lessons learned sessions should include a specific emphasis on recognizing and addressing near misses. Unplanned downtime that was averted through quick thinking on the part of senior staff, heroic effort, or simply good fortune should be documented and analyzed. Once the cause or causes for the near miss are understood, it’s time to make changes. With improved procedures or documentation, the next similar event may be dealt with on a routine basis; the event may be successfully addressed by more junior personnel, and with significantly reduced risk of catastrophe.
Inadequate operating procedures can be a significant Achilles’ heel for critical facilities, causing unplanned downtime or near misses. Complex facilities require thorough operating documentation that is vetted for accuracy, completeness, correctness, and understandability.
Normal conditions, normal responses to failures, emergency responses to more common failures, and transfer procedures should be documented. Too often, key procedures and processes reside only within the minds of senior operators, or worse, only with outsourced service technicians not based at or dedicated to the critical facility.
One simple approach to begin documenting operating procedures is to have technicians use their cell phones to take photos of equipment in normal operating modes, including lights, switches, breaker positions, auto/manual positions, load levels, etc. Then print the photos and tape them to the equipment, adding titles such as “normal switch positions” or “typical voltages, normal load levels as of such and such date.” Any time equipment operating status is manually or automatically changed, especially during maintenance, photograph and make a note about the situation.
Maintaining accurate, as-built construction drawings, equipment labeling, and operating sequences are critical to reliable operation. Think of driving from home to your workplace. On such a familiar route, street signs could be removed or even changed and you might not notice. The same is basically true with drawings and labeling when known errors are not corrected or change indicated in a timely manner. Experienced operators don’t even need to look at drawings or signage for routine tasks. But pitfalls await the new hire, temporary replacement, or new outsourced service technician.
With effort, and not always at significant cost, most critical power failures causing unplanned downtime can be prevented, or if downtime occurs, remedies can prevent recurrence.
Michael Fluegeman (firstname.lastname@example.org), PE, is principal and manager of data center support systems for PlanNet, an independent professional services firm that provides advisory, design, project management, and construction services supporting critical IT infrastructure.
Email comments and questions to
SIDEBAR: The Downtime Blame Game
Unplanned downtime sometimes results from more of a conspiracy of issues than from a single problem. The response often requires a systematic assessment to correct enough issues to significantly reduce vulnerability. Even when a “smoking gun” is found, it may be prudent to continue searching for potentially contributing issues. Then it’s a good idea to test the changes made, in operating scenarios that are as realistic as feasible.
But conducting that sort of objective analysis can become very political. Careers are on the line. So are vendors’ future sales. Finger-pointing or a cover-up of some sort ensues. Blaming individuals who have moved on or equipment that simply failed and needs repair or replacement provides cover. Jobs and vendor status get some protection but facts don’t get reported. Changes to prevent the same or similar issue are not made.
The politics of “lessons learned” from a significant failure sometimes leads to overkill. Managers and executives know how hard it is to explain why a problem was allowed to happen a second time. For example, a UPS battery failure causing unplanned downtime can make replacing batteries with flywheels appealing, even at much higher cost. Even if the flywheels fail, at least it was not another battery failure. However, a more measured response, once failed batteries are replaced, might include beefing up battery maintenance and performance testing, and enhancing monitoring.
A data center group in the United Kingdom aims educate the industry about the true causes for data center crashes. The new Data Center Incident Reporting Network (DCiRN) aims to gather confidential information about the causes of data center failures, covering both facility infrastructure and IT systems, according to the group’s website.