In Data Centers, Human Error Is Most Common Cause of Downtime
Part 1 of a 3-part article explaining how one organization with multiple data centers successfully uses a dual power path environment.
Most professionals in the data center industry have come to recognize that human error is the most common cause of downtime. A review of thousands of facilities system incident reports collected over more than a decade demonstrates risk is highest when computer hardware is installed or removed.
This continues to be the case today. Though “dual power” provides a more forgiving environment and a greatly enhanced potential for continuous operation, the reality is management of computer room electrical distribution is more complex now than during the single power path era. Some organizations have been more successful than others in adeptly managing this challenge.
A major financial services organization based in Texas has realized significant success by addressing that challenge. The firm identified the opportunities for failure in a dual-power path environment, then presented a convincing business case for needed change to the managers of each department that shares the installation and de-installation process. It was also essential to get buy-in and consistent support from senior management. In addition to implementing recommended changes, the firm consistently applied these new processes through written documentation, training, and continued monitoring
This organization operates multiple data centers nationally. Included are two at their Texas headquarters campus, one that occupies 62,000 square feet of raised floor space (2,400 kW) and another at 15,000 square feet of raised floor space (1,060 kW). UPS systems are 2(N+1). Dual power paths are provided to each device and cabinet.
The computer hardware profile is varied, ranging from mainframe and storage devices to servers and network devices in cabinets. A few devices are single corded; any of these deemed critical are connected to automatic transfer switches (ATS). Thousands of individual devices occupy the computer rooms, and the average number of installations or removals is 100 per month. This is a relatively high churn rate for a data center business, presenting a greater challenge than a more static facility has in trying to accomplish continuous operation.
This company has deployed 90 dual-input power distribution units (PDUs) across the two data centers referenced above for more than a decade. They conduct preventive maintenance on each of these on a three-year cycle, with a staggered schedule, so several PDU preventive maintenance activities occur monthly. As with most data center owners, they traditionally spread the responsibilities of installing and removing computer devices, network cables, and power cords across multiple departments, with multiple individuals involved in each group.
Although this arrangement provided for a successful operation most of the time, it did result in several surprise power interruptions to computer hardware devices over time, before the necessity for change resonated with some involved in the process. In addition to discoveries of misconnected devices when a rare single power path failure occurred, there were a few surprises when a power path was purposely shut down for preventive maintenance or repair to a PDU. In general, these were found to involve someone who attempted to install or remove a device without proper authorization, training, and understanding of the configuration.
In this organization’s effort to reduce the risk of additional surprise interruptions, the facilities department successfully articulated to managers in each of the involved departments the problem with continuing to operate as they had. Several issues created unnecessary risk. For one thing, too many departments were involved, including outside vendors who pre-configure some of the racks. On top of that, there was a lack of written, site-specific, repetitive processes. What’s more, training was insufficient.
With buy-in from the senior managers in each department involved in computer hardware installations and removals, the facilities department implemented a series of changes roughly eight years ago.
Today, each device installation or removal must be submitted to the facilities department online via a specific request form. The person making the request must indicate model number, serial number, device name, and desired location within the data center for any new device. The facilities department uses this information to determine projected power draw, heat load, and available capacity of the PDUs, remote power panels, and ATS within the desired location to support the new device or devices. The request is accepted or rejected based on available capacity.
Top photo caption: Red tape identifies the B side of the two cords supplying power to dual-corded devices. No tape is placed on the cord that is fed from the A side. In addition, blue tape is placed on each power feed to a single corded device.