Data Center Reliability: Looking for Trouble
To ensure the safe, dependable operation of data centers, managers need to address potential hazards related to fire safety and power interruption.
By Christopher Dempsey, Contributing Writer
The great data center buildout has arrived, and the competition among tech giants is fierce. Could this monumental initiative create risks and headaches for institutional and commercial building owners and operators? Possibly.
AI will require many massive data centers and a huge increase in power generation – and fast. But it will also demand careful design, construction and operation. Failure to build and operate reliable, resilient data centers increases the risk of costly disruptions and the reputational harm that comes with them.
More than one-half of data center operators experienced outages in the past three years, most often from power failures and cooling issues, according to a 2024 survey by the Uptime Institute. Downtime is expensive: Business interruptions can cost as much as $9,000 per minute and up to $5 million an hour in critical sectors.
While owners and managers might not be able to influence the design of these behemoths, they need to understand the myriad risks, mitigate them where possible and speak up when they can influence design and equipment decisions. If unacceptable risks remain after the data center is built, the next step is to look for cost-effective ways to retrofit them.
Spotlight: Lithium-ion batteries
One common hazard in an AI data center involves lithium-ion batteries. They provide uninterruptable power supplies (UPS) in data processing equipment rooms. They can also store renewable energy on site for use during cloudy or windless periods.
Though essential assets, lithium-ion batteries pose the risk of fire and explosion via a process called thermal runaway, which is a chain reaction in damaged or defective batteries. Once thermal runaway starts, it cannot be stopped. New technology can sense the telltale signs of imminent thermal runaway early enough to head it off. These off-gas detectors can sense electrolyte vapors that a lithium-ion battery gives out in the minutes before a battery is about to go into thermal runaway, giving managers and front-line technicians a better chance to halt the process.
Batteries become prone to thermal runaway due to: electrical abuse, including improper charging or discharging; thermal abuse, where a cell operating outside normal conditions because of electrical abuse or exposure to high temperatures; and mechanical abuse involving cell damage sustained by physical impact.
To date, testing has shown that active fire protection cannot stop the thermal-runaway process when the cells are enclosed, as lithium-ion batteries typically are. But automatic sprinklers are still helpful. They can provide cooling to structures, combustibles and even adjacent modules and packs to help limit the spread of fire.
Water-mist systems also provide viable fire protection for data centers if all concealed spaces are adequately protected. These systems typically require less water than sprinkler systems to control or extinguish a fire, and they potentially cause less water damage to sensitive computing equipment. The systems are already commonly used in Europe, where access to public water can be limited.
Beyond batteries
Lithium-ion batteries are just one hazard that threatens the safe, reliable operation of data centers. Other hazards include:
Electrical failures. Surges and arc flashes can ignite fires.
Networking gear. Wires and fiber optic cables, once contained under suspended data center floors, are more likely to run in combustible plastic trays above servers. The space has more oxygen to fuel fires and less separation from computing equipment.
Partitions. New data center designs employ alternating hot and cold aisles to optimize cooling. Plastic partitions between the aisles can be combustible.
Green facades and mass timber construction. Although these features might offset companies’ carbon footprints, they can be combustible. Solar panels also can introduce new ignition points.
Ignitable fluids. Glycol and refrigerants used in liquid direct-to-chip cooling or mineral oil for immersion cooling are flammable at high concentrations and temperatures. On-site power generation involves ignitable fuels, including diesel or natural gas.
Human error. As in any facility, hot work such as cutting or welding can lead to accidents.
For the utmost protection of data processing equipment, operators can add halocarbon or inert gas — clean agent — fire extinguishing systems with very early warning fire-detection. This approach supplements the automatic sprinkler or water-mist system protection.
By detecting smoldering or off-gassing from overheating or low-energy fires, very early warning fire-detection systems can detect incipient fires in critical areas before flames or even noticeable smoke develops. For information on operational considerations related to these systems, see the FM Property Loss Prevention Data Sheet 5-32.
Power challenges
Although fire can be catastrophic, facilities managers in AI data centers will face other major concerns, such as ensuring electricity is always available to serve power-hungry computing processes. Tellingly, data centers are categorized not by square footage or computing capacity but by the electricity they consume, such as a gigawatt-scale data center.
Specific power requirements depend in large part on the kind of data center a manager is operating. Training data centers are where large language models are schooled on enormous volumes of data so they can later respond to user prompts. They often use the most powerful chips — GPUs and other accelerators — and house large, concentrated clusters of 24/7 power demand. Since consumers never interact with models during training, there is little benefit in distributing these centers closer to users.
Inference data centers are where chatbots and AI tools generate real-time answers, often using the parameters developed when the models were trained. As with conventional cloud data centers, inference data centers need to be dispersed geographically to improve the response time for end users. Training requires a massive upfront surge of electricity, while inference sustains a continuous but lower-level draw.
Traditional off-site power generation and long-distance transmission to the point of use will clearly be strained during the data center buildout. For this reason, co-locating power generation facilities makes sense. The facilities can be hydrocarbon-fueled, renewables with battery storage or even small modular nuclear reactors in the future.
In any configuration, power can be a major responsibility for facilities managers at data centers. The task is enormous, and managers should focus on avoiding power interruption. They need to ensure they have adequate on-site backup systems, alternate providers, separate lines connected to the nearest substation and an up-to-date recovery plan, and they need to practice executing that plan.
Looking for trouble
Beyond issues related to fire and power interruption, other potential sources of disruption to the reliable operation of data centers include:
Physical hazards. Wildfires, floods and severe weather can damage infrastructure. Managers can use hazard maps to understand those events that pose a risk at vital locations and take measures to reduce the consequences. Ideally, a facility has low-flammability construction, ample water supply for fire protection, and flood protection, such as gates or inflatable dams and thoughtful stormwater management. Managers overseeing facilities using solar panels should consider hail-resistant and tiltable versions.
Network limitations. AI response requires instant bandwidth. Managers need to ensure they have high-speed, redundant connections.
Cybersecurity. AI is only as secure as its weakest link, so managers need to schedule thorough physical and digital security audits and harden defenses accordingly.
Human error. Even with automation, people remain in the loop. Operators need rigorous training, clear priorities and regular drills to ensure continuous operations.
Planning and inspection
A great deal at stake for facilities managers overseeing AI data centers, and much can go wrong. It is critical that managers prepare for the worst and avoid preventable loss by taking these steps:
Establish an emergency response team. Formalize procedures for interrupting electrical power at the source, isolating leaking liquids at control valves, notifying the fire department and operating extinguishing equipment.
Plan for power isolation, equipment failure, service interruption, disaster recovery and business continuity in the event of a serious disruption. Have a flood response plan if the facility is in a 500-year flood zone. Give the local fire department a tour of the facility, describe the equipment in use, and share the power-isolation plan.
Perform regular housekeeping. Ensure potential ignition sources — smoking, hot work, temporary heaters and cooking equipment — are controlled. Scan for the accumulation of combustible materials, and make sure spare parts and manuals are stored in closed metal cabinets. To document such best practices for the industry, FM has published research-based guidelines for data center construction, equipment, power, protection, inspection, testing, maintenance and emergency planning.
Although a manager might not work in an AI data center today, chances are the organization uses a data center in some way, and many of these principles apply to data centers of all sizes inside every type of business. By applying hard-earned skills and best practices, managers can help ensure that data centers continue to operate reliably and safely and support the core mission of the facility and the organization.
Christopher Dempsey is senior vice president with FM.
Related Topics: