In Mission-Critical Space, Operations Are Critical

Training and procedures are keys to preventing human error

By Rita Tatum  

All businesses with mission-critical facility infrastructures will suffer losses when the network goes down due to power outages or other facility-related problems, regardless of how short a time the disruption lasts. But when the business relies on data, downtime can add up to millions of dollars in little time. One national online brokerage reported it lost more than $100,000 per minute when its trading system went down. According to McGladrey and Pullen data, one of every 500 data centers has a severe disaster annually. After such failures, 43 percent of companies close their doors, and 29 percent more close within two years.

The cost of network downtime from facility environment and support equipment ranged from $350,000 to $11 million with an average annual loss of $5 million, Tom Poulter reported in the Disaster Recovery Journal.

What is the root cause of system outages in mission-critical buildings? One of the biggest problems is single points of failure, says David Sjogren, president of 7x24 Exchange and principal with Strategic Facilities Inc. The single points of failure that bring critical facilities down come from several sources. These include:

  • Problems with initial design and construction or modifications after original design and construction.
  • Misuse and abuse of systems.
  • Lack of standard operating procedures or improper maintenance procedures.
  • Poor business risk assessment.

As in all data situations, the single point of failure often seems insignificant — until it happens.

David Troup, vice president and director of mechanical engineering at Hellmuth, Obata + Kassabaum (HOK) recalls one data center that had good redundancy on all major systems. The point of failure was getting the makeup water to the chilling tower in the high-rise building. A booster pump served the building and the cooling tower. A failure at the pump controller took all pumps offline, including the one that sent makeup water to the chilling plant. The company now has a dedicated pump and pump riser to feed the cooling tower. “The tendency is to look at the big equipment and miss the little things like cooling makeup water,” says Troup.

Philip D. Sayers, president of Mid-Atlantic Companies for Consolidated Engineering Services, recalls one company with two highly critical servers that had redundant uninterruptible power supply (UPS) systems, two generator sets and two power sets. Unfortunately, the two power sets were both wired on the same power distribution unit (PDU). “When a capacitor in that PDU failed, both the prime and backup system were out of business,” says Sayers. The PDU was repaired quickly, but by the time servers were reconfigured and readdressed, minutes had turned into hours.

Making sure systems work properly can pay off in a big way. Florida’s Peak 10 Technology Gateways learned that in May 2002 when a failed lightning arrestor triggered a series of equipment problems that took down the Jacksonville Electric Authority. Monitoring equipment at Peak 10’s facility noted the loss of commercial power and switched to battery power. Meanwhile, the building’s diesel generator powered up to keep the data center running. The company ran off generator power for about an hour and a half without missing a byte.

Misuse and Abuse

To avoid design flaws, Sjogren recommends facilities executives carefully select architects and mechanical and electrical engineers that are specialists in 7x24 facilities. But even in a well-designed data center or carrier hotel, there are no guarantees that problems won’t occur; the most robust design can be undermined by human error. “Companies overload the power supply or add new servers without adjusting for the additional cooling requirements,” says Sjogren. “They don’t maintain their batteries or mechanical components, thinking they haven’t had an outage yet, so they don’t need to do anything.”

Having a UPS system and backup generator system isn’t a guarantee that nothing will go wrong. Engineer John Cavallaro of Carolina Power & Light Co. describes one potential problem: “The UPS takes over for the loss of power, but the generator does not start. It tries to start, but after a few seconds, it shuts down.”

During the outage is not the time to learn that the emergency power system is not performing properly. Cavallaro recommends having a person on site who is knowledgeable on the system’s technical aspects. This person is responsible for periodic testing throughout the year, as well as periodic cleaning and yearly adjustments.

When equipment fails, human error may well be to blame. “While performing maintenance on one UPS, a technician missed a step before returning it to operation,” says Dennis E. Mulgrew, director of mission-critical services for Consolidated Engineering Services. “As a result, the UPS did not recognize its regular power source and shut itself down, even though utility power was available.”

Standard Operating Procedures

“When it comes to maintenance and daily operations, many people treat data centers and other mission-critical facilities the same as they treat office buildings,” says Sjogren. “But critical facilities are specialty assets that need people who are highly trained to run the operation. They need well-written and exercised standard operating procedures with good maintenance programs and 24-hour emergency response teams that can react in case there is a problem.”

Even seemingly mundane tasks require strict adherence to protocol. Sayers tells of a printer operation where changing the toner in the print machines required turning off the fire alarm breaker first. When the breaker wasn’t turned off, the dust from the toner set off the fire alarm and emptied the facility.

“Many outages are caused by the performance of routine maintenance,” says Mulgrew. “Having a detailed process in place and making subcontractors follow those procedures will minimize problems.”

In service planning, day-to-day operational requirements often are not well covered. This can be corrected by getting input from operational personnel during the building’s design and engineering phases, says Robert J. Cassiliano, president and CEO of Business Information Services, Inc. (BIZ) and 7x24 Exchange chairman. Even for existing data centers, this integration needs to be a continuous process because new applications are often being implemented.

A weakness in operations creates an inability to respond in a timely fashion to problems. Mission-critical locations require 7x24 response. Not all facilities, of course, can afford to have highly trained personnel on staff all the time to correct problems before they cause facility failure. “Even for small data centers, there is the capability today to run continuously by using a building management system and some remote support capability,” says Cassiliano.

All personnel in mission-critical facilities need to be knowledgeable about procedures and receive formal and on-the-job training on new systems and equipment, says Cassiliano.

It’s also important to be aware of what’s underneath the raised floor. Data centers often undergo cabling rerouting as servers are moved, new farms installed, etc. If the wiring under the floor is ignored, however, the discarded cabling may block air flows or prevent sensors from detecting water problems before they become major issues.

“Often when we learn there are hot spots in the data centers, we pop the floor tiles and find 50 miles of communication cabling going nowhere,” says Mulgrew. “The old lines are left until everyone is sure the new cables are up and running. And then, once the new lines are operating, the company doesn’t want to spend money reopening the floor and removing the old wires.”

Knowing the Business

Understanding the details of critical equipment is important, but equally important is the ability to grasp the ins and outs of the company’s business. The facilities executive has to be able to explain how the mission-critical facility affects the performance of the entire organization. One key is to take the terminology associated with the performance of mission-critical facilities and translate it into the dollars and cents of corporate America.

“Instead of saying the facility needs to go from three nines to four nines, we should say, ‘This facility needs to reduce its downtime from 10 hours annually to one hour per year, and that will save this amount of dollars,’” says Cassiliano.

A top executive who doesn’t understand the economic value of a mission-critical facility is more susceptible to penny-wise, dollar-foolish decisionmaking. Consider commissioning. At this stage of the building process, the temptation is to shave dollars. But Cassiliano says that is a major mistake. “You have just spent $40 million or $50 million to build a fault-tolerant facility that keeps running. Now you need to invest $100,000 to $200,000 for integration testing to make sure it performs as designed.”

Industry Knowledge

To be effective, facilities executives responsible for mission-critical space have to keep up with the industry at large. “There is no true curriculum at the university level where you can learn what is needed,” says John Oyhagaray, project manager of Western Union and treasurer of 7x24 Exchange.

“Facilities executives working with professional people from different backgrounds need to gather a real-world understanding of their systems,” he says. For example, the documentation of standard operating procedures from world-class facilities can be a valuable resource for other facilities executives.
“They also need to use external educational forums where they can share their knowledge and learn from their peers,” Oyhagaray says. “Today, you really need to know what other data centers are doing. Not only may you learn how to fix a potential problem in your facility, but you also become more valuable for your organization.”

Contributing editor Rita Tatum has covered facility management and technology issues for more than 25 years.

Contact FacilitiesNet Editorial Staff »

  posted on 1/1/2003   Article Use Policy

Related Topics: