Formula for Reliability
By John Talamo - September 2002 - Data Centers
When mission-critical systems are compromised by inadequate design or management, the potential for revenue loss is great. For example, it’s estimated that the average downtime cost for a brokerage environment can be as high as $6 million per hour. But downtime costs aren’t that high for every mission-critical space. So it’s crucial, early in the process, to determine what level of system availability is appropriate for the application and budget. (Availability is the degree to which a system performs as specified when called upon, expressed in a percentage of time.)
This process must be extended to all support systems. For example, the availability level for the mechanical system design should be equal to the electrical system. If the systems are designed to different standards, the lower of the two will dictate the availability of the entire facility. Establishing criteria like this will set the tone for the entire project and operating conditions throughout the life of the facility.
The key is to understand what costs are associated with various levels of availability. The only objective way this can be accomplished is to quantify the availability for a particular design — 99.99 percent equals 53 minutes of downtime a year, 99.999 percent equals 5.2 minutes of downtime a year, and so on — and use it to identify the opportunity and construction cost for each. Simply put, the higher the availability, the higher the cost. Finding the right balance is ultimately a business decision.
Once the appropriate level of availability is determined, the system should be designed to deliver that level of performance. Often, however, designs undergo shortsighted value engineering to keep costs down. Although it is important to remain within budget, it also is important to recognize that some value engineering recommendations can compromise a system’s reliability. The engineer/consultant must articulate the impact of a recommendation on the system before it is implemented so that everyone involved understands the positive and negative consequences.
But that’s not to say that everything done to make a design cost effective will sabotage reliability. For example, an electrical system arranged in a 2N configuration is a very reliable system. But there are alternatives that can achieve an availability level very close to a 2N system with significant cost savings. Block redundancy, for example, can be used: If three UPS systems are required to address the load, four systems are installed. Block redundancy can reduce costs up to 30 percent, depending on the size of the facility, thanks to a lower component count and faster installation time. The engineer should present and discuss pros and cons for a number of approaches. A similar process is necessary to determine the proper mechanical systems, cable plant and other infrastructure requirements.
An often-overlooked problem arises when two separate critical systems are designed to different reliability standards. In the event the computer equipment of one system is connected to the computer equipment of the other system via data cabling, the entire network can be compromised by the less reliable system. To avoid this situation, careful integration is required between the engineering and technology groups.
Systems also should be designed for ease of maintenance. That includes not only proper physical clearances but also the flexibility to shut down and work on components while maintaining service to the critical loads.
The systems must also be designed to avoid single points of failure, which can’t be remedied by redundancy. Careful consideration must be taken to assure that the failure of any component, sub-system or element — independent of the failure of other components, sub-systems or elements — does not interrupt the critical load.
Part of the solution to the problem of single points of failure is a dual-path topology to all loads. This will provide an alternate source in the event of a component failure. The use of automatic static transfer switch technology plays an extremely important role in eliminating single points of failure by rerouting power to critical loads at speeds that do not affect sensitive computer loads.
Integration should not end there. Seemingly innocuous things — separate electrical feeders entering a common pull box, for example, or single control power circuits serving more than one chiller — can compromise a facility’s reliability by bringing down multiple components with a single event, despite built-in redundancies. In addition, the use of short circuit and selective coordination studies is imperative to isolate and limit faults.
Good design alone can’t assure reliability; systems maintenance and operational procedures for critical facilities are equally important. Innovative engineering design coupled with on-site engineering support is the only way to assure facility reliability.
Benchmarking Is Key
Preventive maintenance and testing are crucial. The key is to benchmark the facility on a routine basis and identify performance deviations from the original design specifications. Done properly, this will provide an early warning mechanism to allow potential failures to be addressed and corrected before they occur.
Once deficiencies are identified, and before any corrective action can be taken, a method of operation (MOP) must be written. The MOP will clearly stipulate step-by-step procedures and conditions, including who is to be present, the documentation required, phasing of work and the state the system is to be placed in after the work is completed.
The MOP will greatly minimize errors and potential system downtime by identifying responsibility of vendors, contractors, the owner, the testing entity and anyone else involved. In addition, a program of ongoing operational staff training and procedures is important to deal with emergencies outside of the regular maintenance program.
It is often difficult for facility executives to allocate resources to monitor changing loads. Typically, changes to and additions of computer equipment go unregulated. Different end-user groups often make modifications without a central clearinghouse to monitor the impact on the infrastructure. As a result, it becomes very difficult to know where a facility stands with respect to such things as system loading, available spare capacity, impact under fault conditions, cooling capabilities and cable management. To ensure this does not become a potential problem, a load management system with a single-point gatekeeper should be incorporated into daily facility operations.
A load management system will provide a tool for ongoing monitoring and maintenance. At a minimum, it should include an inventory of computer equipment with mechanical and electrical load data, location and numbering system for all equipment, the way each piece of equipment is served electrically, and a load schedule for the electric service, UPS system, power distribution units and generators. The goal is to have an up-to-date snapshot of the facility and to understand the response of the system under fault conditions. The load management system would also take into account the dynamic effects of automatic static transfer switches and dual- and triple-corded computer equipment to ensure that design parameters are not exceeded during a fault or equipment failure.
The right resources are not always available to manage mission-critical facilities. As a result, it can be useful to develop relationships with third parties that bring expertise in design, construction and operations. The key is to develop partnerships with companies that can provide a broad range of experience in all aspects of critical systems.
It is only by taking a holistic approach to critical system facility design — looking at all design issues and assuring that ongoing maintenance and testing occur — that high reliability goals will be achieved.
John Talamo manages EYP Mission Critical Facilities® Inc.’s corporate office in New York City. Talamo has more than 20 years of experience in the design and management of highly reliable mechanical and electrical system projects, especially for corporate facilities.