Critical Facilities Summit

4  FM quick reads on critical facilities

1. Developing a Fire Protection Plan for Data Centers


A properly designed and selected fire detection and notification system is an essential component of an overall data center protection strategy. Aspects to consider when evaluating fire detection options include early initiation and response, as well as the interface with other systems such as suppression, ventilation shutdown, and early operator warning.

In today's data centers, power densities may exceed 400 watts per square foot, almost all of which is transformed to sensible heat by the servers, storage devices, and networking equipment located within the data center usable raised floor area (white space). Such power densities require significant quantities of air (3,700 cfm or greater) to provide cooling at typical design conditions. Fire detection and suppression systems designs need to account for the challenges posed by high airflow velocities associated with high power densities.

There are numerous detection approaches available for use within data centers to provide early warning of a fire. The inherent criticality and essential nature of the equipment in data centers will often dictate the detection approach. Detection strategies typically include spot-type smoke detectors, air-aspirating smoke detectors (e.g., smoke sampling chamber with a sampling tube network), or a combination thereof. As with any approach, there are advantages and disadvantages to these strategies. The data center design and stakeholder goals help to guide the detection design.

Spot-type smoke detectors are a very common smoke detection strategy used in data centers. Newer types of intelligent, spot-type detection technologies use built-in algorithms with multiple sensors or multi-criteria to adapt to their environment and minimize the likelihood of false alarms. These detectors often rely upon multiple analog sensors — such as smoke, heat, and carbon monoxide sensing elements — that are processed through proprietary algorithms.

Cross-zoned smoke detection is typically the preferred strategy when utilizing spot-type detectors in data centers. This design relies upon the activation of two alarms before subsequent action, such as opening of a pre-action valve or clean agent discharge. A cross-zoned strategy minimizes the potential for an unwarranted discharge of a fire suppression system. The initial detector can provide a warning to operators and staff within the data center. The need for two separate detectors to activate, however, results in activation delay. This delay may compromise property protection and business continuity goals.

Air-aspirating or air-sampling detection is becoming increasingly popular in data center applications. This type of detection is known for its ability to detect a fire in its incipient stages and therefore provide earlier warning and faster response time than traditional detection.


2.  Pay Close Attention To Staffing Challenges In Critical Facilities

One key consideration to reliable operations in critical facilities is the challenge of staffing issues. Although the data center industry has been successful at "hardening" facilities and physical infrastructure, it has not done as well with the associated operating staff and facility management aspects.

It is widely recognized that the vast majority of critical facility problems can now be attributed to human error (some sources claim as high as 70 to 80 percent of problems). There has been a direct correlation between the increase in infrastructure complexity and the increase in human error by the operating staff.

The problem here is not one of availability. Most critical facilities have staff on-site continuously (100 percent availability). The problem is in staff reliability (and in some cases validity). Unlike computers, people get tired, distracted, sick, confused, etc., which can all lead to unreliable performance.

The answer is to have processes that produce reliable results. Detailed, step-by-step procedures are a good example, but unless they are followed correctly each time there is no guarantee of a reliable outcome. On the other hand, if operating staff is required to initial a checklist as each step is completed, and to have the action witnessed by a separate participant, the reliability of the process improves greatly. If the procedure also describes the expected outcome or result associated with each step, such as the expected pressure and flow when starting a pump, or the expected indicating lights and annunciations when closing a breaker, then the validity of the process is ensured. Operating staff who blindly follow a procedure without equal attention to the results will inevitably produce unintended outcomes.

Performance-based training is also a process. Training a new computer or controller is simple and quick. Download the programming, connect to the network, and the new computer is 100 percent as capable as the failed computer or controller that it replaced. Training new staff isn't quite as easy. People are individuals. Each of us is a one of a kind.

3.  How To Limit Human Error In Critical Facilities

Over the last 15 years, most building operators have come to recognize people account for the majority of interruptions to critical operations. Human error is identified as the root cause in 60 percent to 80 percent of data center downtime events, year after year.

Those who maintain the critical facility's infrastructure systems require written procedures to consistently carry out riskier activities such as system transfers, when system redundancy is reduced as equipment is brought off-line for maintenance or repair. Just as important are procedures for resolving emergency scenarios. A critical facility may require 150 to 200 documents to cover both of these categories, due to the number of infrastructure systems involved. This number seems high when compared to a non-critical facility's needs. However, by comparison to another critical endeavor, it is roughly one-fifth the number of procedures required for operating a nuclear submarine.

In all cases, procedures need to be site-specific, as each facility's configuration is unique. One individual on the facilities staff must be assigned the role of procedures owner and be provided dedicated time each month to make continual progress with the program. Typically, the procedures owner is provided a contracted resource to get the program started.

Written processes are much more important when addressing areas where personnel from multiple departments have access. In the case of a data center facility, the computer room is most critical. Tasks performed there present the greatest risk of error, because multiple departments are involved and a higher frequency of human activity occurs within the room.

To reduce the high potential for error when multiple groups work together in one space, it is necessary to develop written mutual expectations between the departments involved. Some organizations refer to these as internal service level agreements. The documents can be as simple as one page, but must be endorsed by each department head and be consistently enforced.

4.  Complexity Complicates Data Center Maintenance

The issue of complexity and computers resides within much of today's computer equipment. Just open the panels and cabinets of uninterruptible power supply (UPS) units and paralleling control cabinets, chiller control panels, paralleling switchgear, etc., and look inside. To most operating staff, this equipment has essentially become black boxes as well. So as the infrastructure has outpaced the staff's ability to troubleshoot and repair, the reliance on good maintenance practices becomes even more crucial.

Computers, programmable logic controllers, device-specific controllers, etc., are essentially "black boxes," which can complicate data center operations and maintenance. They typically don't give advance notice of pending failure, and when they do fail, the operating staff cannot make repairs or replacements. They have to call for vendor support and take manual control of the infrastructure involved.

The basic purpose of maintenance is to increase the availability of the equipment (and systems) being maintained. At the bottom of the pile is "corrective maintenance," or simply put, "fix it when it breaks." It takes the least effort from a management perspective, but results in the lowest availability and in most cases ends up costing the most in both total cost of ownership (TCO) and impact to operations.

The next rung up is preventive maintenance where you (hopefully) follow the manufacturer's recommendations to inspect and care for the equipment to extend its life and optimize its performance. In this case, you live with some planned unavailability (shutdowns) to afford the opportunity to care for the equipment (check belts, change filters, torque connections, etc.). The result is increased lifespan, more reliable performance, and lower failure rates.

The best practice is to supplement a preventive maintenance program with predictive maintenance using on-line condition-monitoring technologies. The most common and valuable on-line condition-monitoring technologies are thermography (infrared scanning) and vibration analysis. These monitoring techniques not only provide incredible insight regarding the health of the equipment, but actually require the equipment to be in operation, so the need for outages is reduced. By trending the results over time, a facility manager can see the health of the equipment start the inevitable decline towards predefined thresholds and "predict" when the equipment condition or performance will be adversely affected.


RELATED CONTENT:


critical facilities , data center reliability , data centers , fire protection

NFMT Vegas - Register Today!


QUICK Sign-up - Membership Includes:

New Content and Magazine Article Updates
Educational Webcast Alerts
Building Products/Technology Notices
Complete Library of Reports, Webcasts, Salary and Exclusive Member Content



click here for more member info.