Steps to prevent human error are essential to maintaining data center reliability
Because so much information is stored and shared digitally, data center downtime — no matter how minute — can be a significant problem. There are highly robust and efficient technologies to minimize downtime. But a plane is only as good as its pilot, and data centers are no different. Industry experts indicate that between 65 and 80 percent of data center downtime is due to human error.
“The greatest challenge is not the reliability of the hardware, or the network design, or the infrastructure,” says Bob McFarlane, principal at Shen Milsom Wilke. “Those can’t be neglected, but it’s possible today to engineer all of them with an extremely high level of reliability if the funds are available. It takes only one human error, however, to undo all the investment and all the attention to design.”
Mission-critical facility staffs must be able to handle constant change in a timely fashion — in a technologically complex infrastructure — without interrupting critical business services. That means staff skills are crucial. “No matter how advanced the equipment installed in the critical facility is, unless it is operated correctly and maintained properly, it will eventually fail with possibly catastrophic consequences for the owner or operator,” says James Szel, a senior vice president with Syska Hennessy Group.
Mike Fil, vice president for data center operations at the Philadelphia Stock Exchange, puts it bluntly. “Humans can, do and will fail.”
Simple and Clean
Good design can play an important role in preventing errors. The key is ensuring that the design makes sense to human workers — whether in-house staff or contracted vendors — who have to work with installed equipment. For example, no data center should be built without ensuring that the layout is logical and the various components within the data center are clearly identified.
“There is absolutely no reason to have illogical infrastructure in a data center,” says McFarlane.
For example, when two circuits in a cabinet are fed from two different sources, as they always should be to run dual-corded equipment, they should have the same panel and circuit numbers in both power distribution units (PDUs), McFarlane says.
This can help avoid the mistake of reading the circuit number on one outlet, and then tripping the wrong breaker in the “sister” PDU.
Care should also be taken to minimize confusion over which panel or PDU controls various outlets in the facility. According to McFarlane, color-coded labels are the simplest and most effective way of accomplishing this. Receptacle labels should be the same color as the associated PDU labels. It is a simple fix that’s possible within existing facilities, and it makes everything much easier to understand at a glance.
Whether it’s cabinet numbers, circuit numbers, PDU numbers, patch panels, cables or other items, the identifications should be as large, bold and readable as possible. Cabinets and PDUs should bear placards that can be read from across the room. Receptacles also should be labeled with easily read, permanent graphics, according to McFarlane.
Cable labels should follow a uniform and logical scheme that immediately identifies the purpose of the cable without requiring a worker to resort to books or schedules, and labels should be placed so they can be read easily without unbundling cable packs, McFarlane says. Patch panels — which are difficult to label due to the high densities and small label space available — should use “master labels” on each patch panel. Doing so means each jack identification should use only a few characters. Finally, labels should indicate in a way that is intuitively clear where cables originate or terminate.
Permanent rack and cabinet cable inter-tie infrastructure can minimize the amount of ad hoc cabling needed, which in turn minimizes cable-tracing errors, cable damage, and accidental disconnection of the wrong cable, provided everything is labeled clearly and consistently. Again, color coding can make cable purposes, patch panel segmentation and patching more obvious and consistent.
Emergency power off (EPO) buttons should be especially well-labeled. Consider the anecdote offered by Chris Wade, senior manager, ISD facilities operations for Wal-Mart, to illustrate the need for logical and well-placed information about every component in a data center.
A custodian was cleaning the data center raised floor at a major banking institution. When he finished working in the room, he walked toward the exit door and saw a red button on the wall next to the door. He reached up and pushed the button, thinking it was the door operator used to exit the area. “Of course, as soon as he hit the button, the room became eerily quiet,” Wade says. “There was dead silence.” The button that the custodian pushed was actually the EPO switch.
Play By the Rules
Another way to minimize human error is by having the right policies in place — and enforcing them. Deviation from set policy is frequently cited as one of the leading reasons for mission-critical facility failure.
McFarlane advocates use of policies for every scenario. “All equipment adds, moves and changes should be documented ahead of time, “ he says. This includes cabinet, rack unit location, circuit and cable numbers.
He also promotes reliance upon a “buddy system” when making changes. In other words, no equipment change should be made without two people present to maximize quality control.
“Every change should be checked and signed off by a third party before being activated,” McFarlane says. “If that sounds like time you can’t afford, just consider the time required to restore a critical service, as well as to explain what caused the problem. It’s probably more efficient to do it the right way.”
Policy can be extended to another weak point in some data center operations: physical security. Very secure data centers have access requirements so tight that even the CEO of the organization is denied access. The rationale is simple: If a person has no business in data center space — even if that includes C-level executives — why take the chance that one of them might inadvertently activate the EPO button?
All visitors, even regular and trusted vendors, should get prior clearance from data center staff, an orientation briefing and an escort. Most importantly, visitors to sensitive mission-critical facilities should have a valid reason that meets company guidelines.
Everyone should follow established procedures. Wade recalls a situation in which procedure was not followed and downtime occurred as a result.
A maintenance technician was beginning preventive maintenance on a small UPS system. The facility was supposed to be powered by the generator while work was in progress. The technician shut down the UPS to begin his work. However, he did not follow procedures and dropped the power to the servers in the data room. Unfortunately, the switch for the generator control panel was in manual mode, not in automatic. The generator did not start.
“If the technician had used the checklist provided in the procedure, he would have found the issue with the switch and this incident would have never occurred,” Wade says.
Train, Then Test
A growing number of corporations have come to realize the business value gained by providing a continuous training program for their data center facilities staff. The more knowledgeable the staff is, the more efficient and effective the operations and maintenance of the facility will be. That means the data center will be more reliable while the overall operating cost will be lower.
Effective training programs are designed to prepare staff for the full range of situations they may encounter. One example is the use of “fire drills.” The data center version of a fire drill can be a business-saving tool that teaches facility staff how to react in the event of an operational emergency.
“Unless the engineering personnel are regularly trained in responding to operational emergencies, even skilled staff may fail to respond appropriately,” says Szel.
Emergency drills should mimic a true downtime incident as closely as possible. Szel also recommends carefully observing and assessing staff with knowledgeable personnel. Evaluating and then providing results of the assessment can be extremely useful in balancing the individual skill sets of the staff.
Talk to Facilities Staff
One obstacle to smooth data center operations is the gulf between IT staff and facilities staff, says McFarlane. “Each department needs to understand that both the operational and fiscal goals of the company must come first,” he says.
Collaboration between facilities and IT can help reduce downtime. How is this accomplished? Negotiate at as high a corporate level as necessary to get select facility staff assigned to work routinely, if not exclusively, with IT. Facility staff should be included in the planning stages of system upgrades and changes. Conversely, IT staff should spend time with facility staff who are negotiating UPS or air-conditioning service contracts. Facilities and IT staff should go to conferences together that address the power and cooling requirements for data centers. In short, the two departments should learn each other’s needs, frustrations and problems.
That’s exactly the case at the Philadelphia Stock Exchange. The data center for the Stock Exchange is nearly 25 years old. Because of age and data center growth, IT and facilities concerns are intertwined. Consequently, Fil and his IT staff have come to rely upon the knowledge of facilities staff quite heavily. “We are strong believers in predictive and preventive maintenance,” he says.
When Contracting Is a Good Idea
Contracting services out might seem like the simpler — and less expensive — solution when operating a data center, but third-party contractors may not be the best option for a particular organization.
According to Wade, independent studies indicate that contracting out mission critical facility services yields a 20 to 40 percent cost reduction. He cites areas where contractors often have an advantage over in-house staff, including:
- Improvements in management techniques.
- Better and more productive equipment.
- Greater financial incentives to innovate.
- Incentive pay structures.
- More efficient deployment of workers.
- Greater use of part-time and temporary employees.
- More work scheduled for off-peak hours.
But in-house staff for mission-critical facilities offers benefits which are hard to quantify. Depending on the organization, the following strengths of in-house staff can be important:
- Better organizational and market-specific knowledge.
- More customer-oriented.
- Better knowledge and history of the organization’s facility.
- Part of the company culture.
- Greater focus on maintaining the company’s interests.
Determining which benefits are of greatest importance can help an organization choose how to use contracted help. Some businesses use a mix of in-house staff and contracted workers because it best suits their organizational mission.
That’s the case with the Philadelphia Stock Exchange’s data processing operations, which uses contractors to support power and cooling systems.
“Cost-prohibitive” is how Fil describes the idea of having in-house facility staff for the data center on site 24/7. “If a problem occurs, we have built sufficient redundancies in our power and cooling systems that allows us to operate until personnel and vendors can arrive onsite and resolve the situation.”
About 7 x 24 Exchange
7x24 Exchange is a member organization for executives in mission critical facilities. The Exchange’s mission is “to improve end-to-end reliability by promoting dialogue among those who design, build and maintain mission critical enterprises and informational infrastructures.”
Founded in 1989, 7x24 Exchange was shaped by a group of technology and facility professionals. “At that time, information technology and facilities executives worked in separate silos, not as a team,” says Robert Cassiliano, chairman of 7x24 Exchange and president and CEO of Business Information Services, a tech services company. “We decided we needed a forum where people could talk together about the issues and understand the problems that facilities people face and technology people face. We held our first meeting in a brokerage house in New York City with 16 people.”
Cassiliano says that the most valuable asset the organization brings to professionals are the bi-annual conferences. “We added seminars during our last conference because we wanted to add content that provides value to conference participants and their companies,” he says.
Twice-annual conferences are helpful, Cassiliano says, because of the rate of change in the industry. “The collaborative nature of 7x24 Exchange members helps keep companies informed about the many new developments taking place in the rapidly changing world of uninterruptible systems and infrastructures,” Cassiliano says. “The conferences are an opportunity for members to see each other twice a year and share knowledge, success stories and insight.”
In addition to conferences, 7x24 Exchange also publishes Newslink, and encourages participation in regional 7x24 Exchange chapters to further enhance knowledge sharing between members.
The 7x24 Exchange recently upgraded its Web site adding features to make the site more useful to facility executives.
- On discussion forums, facility executives can exchange information with their peers and post questions on industry topics.
- A technology news section highlights such developments as the introduction of new chips for servers.
- Links connect to the Web site for the 7x24 Exchange bi-annual conferences, which enables potential attendees to interact while previewing session descriptions, speaker background information, special events and tutorial sessions of interest.
- A career center caters to both job seekers and employers with a searchable resume data base, automatic notification of new postings and job tracking capabilities.
Loren Snyder is a freelance writer and former managing editor for Building Operating Management.