Crash: Data Center Horror Stories
By Daren Shumate - October 2006 - Design & Construction
Every facility executive responsible for data centers can tell at least one horror story. Some are from direct personal experience; others are data center legends. Who hasn’t heard the story of the hapless IT professional who leaned a box against an emergency-power-off switch — powering down the data center — or the curious data center newcomer who wanted to know what the little red button on the front of the power-distribution unit did? Stories like these show how hard it is to prevent data centers from failing. Every data center is unique — there is no other like it. Every design is a custom solution based on the experience of the engineer and the facility executive.
Data center failures can be rooted in several sources — design, construction, maintenance, quality of material, quality of equipment, commissioning and direct human intervention. For the most part, data centers, even ones that fail, have the benefits of good design practice and intention, professional construction oversight, and high-quality craftsmanship. They are maintained according to data center quality guidelines. But a single overlooked mistake can quickly escalate into more significant issues — power and air conditioning failure — that can bring down a data center.
A good example comes from the colocation business, which is made up of real estate companies that offer tenants space, not in office buildings, but in data centers. The occupants are servers, not people. The data center real estate company brands its services based upon a promise to deliver non-stop climate control and power reliability. One moment without cooling or power harms not only the tenant, which stands to lose revenue as a result of down time and recovery time, but also the colocation company’s business model. Because customers who use colocation space are not necessarily part of the design, construction, commissioning and maintenance process, data center facility executives take on an extra responsibility of ensuring that the buildings run correctly.
One data center real estate company that maintains more than a million square feet of colocation space nationwide recently lost a data center as a result of a construction error that exposed a design miscalculation and a commissioning flaw. Cabling between the generators and the paralleling gear had been damaged during construction. While being pulled through the conduits, the cable insulation had been nicked and scraped. The damage was not enough to be detected by normal meggering — a test of the resistivity of insulation — but enough to create a weak link in the mission critical power chain. Eventually, the cable insulation failed.
If all things are correct, the loss of a cable should not be an issue. The design engineer had foreseen the potential for generator system failure and had designed paralleling gear with the programmable logic controller (PLC) programmed to handle this fault. When the fault occurred, the PLC began shutting down the entire generator bank. With the system experiencing a cascading failure, the PLC was unable to intervene.
When the shutdown event was complete and the paralleling switchgear was cold, the entire site transferred to the battery. Within the design time of 15 minutes, the batteries were depleted and all customers were left without the service of their computers. The data center had failed and the colocation company’s branding promise had been seriously compromised.
Why did this happen? Was it a construction error? A commissioning oversight? Could this be pinned to the owner’s design manager, the one who devised the paralleling scheme from the beginning? How about the engineering design team?
There were multiple causes for the failure. In this instance, a construction craftsmanship issue revealed a design shortfall.
Diagnosing the Problem
In hindsight, it is clear that even more rigorous testing before commissioning was needed. Additionally, this failure indicated that the PLC had not been programmed correctly to clear this fault condition and thus had not been commissioned with this fault scenario. And this sequence should have been part of the preventive maintenance program — a change that was made following the disaster.
The design/commissioning team had not anticipated the exact failure sequence. This project would have benefited from more involvement during the design phase from a commissioning agent with specific experience in PLC programming. Additionally, a third-party reviewer with topical design and operating experience would have added value if brought into the design process.
Every data center is one of a kind. The better the commissioning team can simulate real-life scenarios, the more reliable the data center will be.
If the data center just described went down with hardly a whimper, another data center crashed with a literal bang.
In a multistory, high-profile government data center, a busduct-panelboard connection exploded, effectively shutting off power to approximately 15,000 square feet of the most critical computing in the facility.
In this incident, the design relied on an isolated redundant uninterruptible power supply (UPS) back-up. When a UPS system failed, a static automatic transfer switch was to shift to the already-operating isolated redundant UPS and transfer the load within a quarter cycle. The system worked well and the client was satisfied with the transfer scheme and the rotary concept.
Source of the Problem
Where this system failed was downstream from the automatic transfer switch. Each of the switches fed one busduct riser and terminated directly into a main distribution panel located on each floor of the facility — one busduct per panel. A single fault on any busduct or main distribution panel compromised the critical load.
As it occurred, the electrical connection between the busduct and the distribution panelboard failed and the load was lost. A single point of failure succeeded in bringing down the floor. Not until the facility’s electricians ran jumper cables from one of the intact risers and back-fed the main distribution panel did the floor have power.
Why did this failure occur? The building had been designed in tight coordination between the government representative and the designer; the entire system had been commissioned and had been running with tight oversight for more than two years. What happened?
The cause of the problem was the failure of a manufactured busduct connector, one of hundreds in the building. The connector joined lengths of feeder busduct via a sliding piece — designed to slide approximately one-quarter of an inch to make installation easier — and a break-away torque bolt designed to ensure that the installer did not over-torque the bolt.
Although the investigation team was not asked to explain exactly why the joint exploded, it determined that the quarter-inch of play designed into the connector had actually allowed for a portion of uninsulated section of the copper busduct to be exposed to the atmosphere without insulation. The team surmised that the perfect combination of air borne dust, humidity and possibly other contaminants led to an arc that became a fault and exploded.
During the analysis, the investigation team isolated each busduct riser from the static automatic transfer switch at the source and from the main distribution panel at the termination. During the megger test, the electrical forensic team discovered two additional joints that didn’t pass, clearly more candidates for potential failure. Not only did the joints not pass the megger test, two of them visibly and audibly arced while the voltage was ramped up during the testing. The joints had shown themselves to be the weak link in the system. The installed busduct technology was vulnerable to catastrophic failure.
This emphasized the importance of several lessons that might seem like common knowledge, but nevertheless slipped past all parties in the complex design and construction process of the data center.
The first is to eliminate single points of failure. Had there been dual paths to the critical load and either static switch power-distribution units or rack-mounted static switches, there would have been no data center failure.
The second lesson is to use conduit and wire in lieu of busduct. Every electrical connection is a potential failure. The feeder busway system installed had mechanical connectors every 12 feet. Conduit and wire only have connectors at the source and at the load.
Lesson number three is to use only data-center-grade equipment in data centers. The installed busway was inherently unreliable because human error led to one failed connection and the two additional failed connections uncovered during testing.
Unfortunately, data center professionals do not necessarily have the chance to test drive a facility before it’s completely operational. At the end of the day, every data center is a unique and professionals must take all of the right steps to make sure they anticipate future mishaps and learn the lessons of previous experiences.
Five Elements of a Reliable Data CenterBuilding and designing a data center is a complicated process. The complexity is compounded not only by the building type, but by the fact that each data center is unique, built and designed to meet specific criteria. A successful project depends upon five things:
—Daren Shumate |
TECHNOLOGY UPDATE
|
Daren Shumate, PE, serves as a principal and director of the MEP engineering studio in the Washington, D.C. office of RTKL, an international architectural and engineering firm. He has extensive experience in electrical engineering and related specialties, including power distribution, lighting design, life-safety systems, security systems, telecommunications, emergency standby power, power conditioning, controls and instrumentation. His responsibilities have included the management of a wide variety of projects from conceptual design through commissioning of new systems.
Comments
howard wrote re: Crash: Data Center Horror Stories
on 3/16/2011 8:42:51 PM
Last year the spate of hot weather had also knocked off a London data centre which was experiencing a cooling failure, bringing down servers belonging to Last.fm for nearly 5 hours!
Regards,
Howard




