To the public, the collapse of the Northeast power grid last August was a surprise. But to those in the know, the blackout had been 30 years in the making. For more than a decade, specialists in mission-critical facilities had been preaching the importance of training personnel and testing backup systems under long-term blackout conditions.
“Aug. 14 had a major impact in our world,” says Peter Gross, chief executive officer of EYP Mission Critical Facilities Inc. “One thing we learned is that the power grid is fundamentally vulnerable, considering the kind of power the digital economy needs. It was another validation of what we all knew, but was painful to go through.”
The vast majority of backup systems performed well because the systems were well designed and the facilities adequately prepared. But those that weren’t as prepared suffered the consequences.
“The blackout provided some surprises in cases where building operators failed to test or copy the conditions of a blackout,” says Rob Friedel, senior vice president of the Syska Hennessy Group consulting firm.
For those whose systems worked initially, the duration of the blackout became a problem. As the blackout dragged on and more equipment was added to backup systems, it became more difficult for those systems to remain operational.
If that weren’t enough, problems outside the building and outside the reach of facilities executives created even more trouble. For example, fuel distribution plans from vendors went awry, and systems beyond the building’s own uninterruptible power system failed.
Experts in mission-critical performance say many of these issues could have been avoided with better crisis management and strategic planning.
While the blackout was a painful event, it was also a learning opportunity.
“When blackouts occurred during the 1960s and 1970s, a facilities executive was in a different position,” Friedel says. “Back then, a blackout was somebody else’s problem. That was before the concept of mission-critical facilities and the importance of today’s technology. Now, the cause may be somebody else’s problem, but the reaction belongs to the facilities executive.”
The length and scope of this blackout underscored that point.
“What’s special about this blackout is the size and duration, and that it affected major urban centers, where critical facilities are located,” says Guy Despatis, director of engineering for HOK’s San Francisco operations, who helped facilities executives navigate California’s rolling blackouts two years ago. “Keeping the equipment going is one thing, but many businesses needed to get people back to work.”
To help keep businesses functioning during a lengthy crisis, Friedel recommends that those responsible for facility uptime review disaster recovery plans and internal steps for maintaining functionality.
Facilities executives also need more emphasis on training. Organizations should clearly state who needs to report back to work at first word of a disaster and who is responsible for what function during a crisis.
“Everyone on the team must know what to do,” Friedel says. “Unplanned things are going to occur. You don’t want people freelancing during a crisis.”
Despatis adds: “Businesses that operate 24x7 need clear procedures when systems fail. There is not much time for staff to think and react when these things happen. They have to be prepared.”
He points to a bank that tested periodically. “Every time they tested, they found something wasn’t working quite right,” he says. “This testing allowed them to keep operational uptime at a maximum.”
Human error and poor management choices usually are the underlying factors that cause a small problem to become a big problem, says Kenneth G. Brill, executive director of Uptime Institute, a brain trust for data centers seeking to improve their 24x7 availability.
“Failures are never the result of one thing,” Brill says. “There are typically five to 10 interacting malfunctions or decisions. It is a domino effect.”
Human activity is responsible for 70 percent of the failures that cause data center outages, Brill says. Of that number, 21 percent of the problems are directly related to human error. The rest are really management problems: the result of someone permitting the existence of conditions known to cause problems.
For example, if an employee is told to push a specific button and the button is not clearly labeled, that is a management problem. If the button is clearly labeled, but the employee still pushes the wrong button, that is human error.
“You can’t expect your crew to do the right thing if you don’t give them the right training, the right procedures and the right tools,” says Brill.
In addition to training and testing, facilities executives must review their vendors’ support plans.
“It’s not enough for a vendor to say, ‘We’ll take care of you.’ Facilities executives need to know exactly what vendors are going to do to keep their commitments,” Friedel says. “Facilities executives must take a broad look at power loss.”
Brill recommends that facilities executives hold vendors accountable for their advice and promises.
Selecting a generator rating and testing its functionality are crucial steps in protecting a building against disaster. Typically, most installed generators are standby rated, which means they can run for two out of 24 hours. During the blackout, many standby generators were intentionally overloaded because management thought that they wouldn’t have to run for very long. Brill says that eight of his organization’s 51 member companies were affected by the blackout. Of those eight, six ran their engine plants for as long as 48 hours.
With this case study in mind, Brill says facilities executives need to review the way they test their backup systems. Generators fail at three points: immediately, within the first hour and within the first 24 hours. Typical problems include batteries with insufficient capacity, low fuel supplies, overheating and failing pumps.
Currently, most tests are really exercises, meaning the generator gets fired up once a week and allowed to run for a few hours without load. This proves only that the engine runs. A more valid test is to run the engine with load for a few hours and, at least once a year, allow it to run with load for 24 to 48 hours. Brill says that Uptime members who conducted the more grueling tests reported no failures.
“We need to rethink the way people test and maintain facilities,” Gross says. “There needs to be more system level testing, not one generator or battery at a time.”
As the need for more mission-critical facilities increases, next-generation designs should include more measures to handle long-term blackouts. Events such as Sept. 11, the Northeast grid blackout and the California rolling blackouts are changing the way backup systems are being engineered.
“There’s a change in design philosophy,” Gross says. “Designing facilities to handle a much longer power interruption is a wise thing to do.”
Friedel suggests that designs for new facilities as well as expansion plans for existing space include strategies for larger emergency power distribution systems.
Despatis recommends that plants be designed to cover four or five days of operation. “When you talk about 7x24, you have to keep in mind critical equipment design to be assured that you’re always operational.”
Fuel storage is a specific issue that must be addressed. Most plants have only two to three days of fuel on hand. A major blackout could delay delivery of additional fuel; therefore, additional storage must be considered.
Cogeneration is another possibility for maintaining long-term functionality. In an environment of terrorism and sabotage threats and the risk of more blackouts, on-site power generation is gaining more attention.
Some facilities in the San Francisco Bay area have gone that route by providing cogeneration for part of the building load. Planning for a full building load may not be necessary if only part of the building is mission critical. “It’s rare that the whole building needs 7x24 protection,” Despatis says.
Still, cogeneration is a major investment that must make economic sense. Heat and power generators could improve plant efficiency and might make the concept more attractive to the user. However, cogeneration brings a new set of issues into play, such as fuel costs and storage, Gross says. It also requires a skill set that may not be present among current personnel, particularly in the case of data centers, Brill points out.
No one knows what would be the impact on the grid if thousands of onsite generating plants were to come online and offline, Gross says. “There are major issues with large generation plants and their connection to existing grid creating a level of instability.”
Facility executives have many options for keeping their 7x24 facilities online. The bottom line is to understand operational requirements and to clearly determine what is required to sustain a mission-critical facility for a longer outage.
Leveraging the experiences of its 200-plus member companies, 7x24 Exchange is the top information exchange organization for those who design, build, use and maintain mission-critical facilities. The organization hopes to improve the up-time of all 7x24 facilities by providing a forum for exchanging knowledge and experiences.
“7x24 Exchange tries to keep its membership focused on business objectives, and facilities and technology working as a team, instead of facilities executives focusing only on facilities and IT executives focusing only on IT,” says Robert J. Cassiliano, 7x24 Exchange chairman.
Its conferences bring together experts in the fields of information technology and facilities design, construction and management to share ideas on how to improve the end-to-end reliability of critical infrastructures.
For example, following the destruction of New York’s World Trade Center, the 7x24 Exchange brought together organizations with facilities affected by the tragedy so that its membership could learn from the experience. Currently, the group is helping its members improve performance based on the lessons learned from the Northeast power grid outage.
In addition, the Exchange works with vendors to address member concerns, says David Sjogren, the organization’s president. The association was instrumental in driving vendor acceptance and implementation of dual-cord technology for mission-critical facilities. The Exchange has 256 member organizations, representing more than 5,000 individuals. It is open to anyone who has a role in maintaining the uptime of a mission-critical facility.
“We bridge the gap between facilities people and IT people,” Sjogren says. “We provide an opportunity to gain knowledge that may not be written down anywhere.”