How to Minimize Human Error, Prevent Data Center Downtime

By David Boston  
OTHER PARTS OF THIS ARTICLEPt. 1: This PagePt. 2: Operations Objectives Should Drive Data Center Staffing DecisionsPt. 3: Develop Comprehensive Work Rules, Procedures To Minimize Human Error In Data CentersPt. 4: Site-Specific Infrastructure Training Can Help Limit Data Center Human ErrorPt. 5: How To Use Incentives To Improve Data Center Staff Retention

Over the last 15 years, most building operators have come to recognize people account for the majority of interruptions to critical operations. Human error is identified as the root cause in 60 percent to 80 percent of data center downtime events, year after year. Infrastructure systems and component failures still merit attention, but today's rigorous design, construction, and commissioning practices generally provide an expectancy of smooth equipment operation for 10 years or more. Assuming your facility is provided adequate systems, redundancy, and capacity, more attention should be focused on successful operating practices to ensure human error potential is absolutely minimized.

A majority of building owners fail to develop and implement effective operating strategies. This is alarming, given the industry's awareness that people present the greatest risk. People are critical to successful building operations, to ensure regular maintenance is performed, customer requests are fulfilled, and to respond to unexpected system incidents. It is the facility manager's job to provide them the tools to be successful.

As a facility manager begins to implement (or enhance) the optimal facilities operations strategy, the first step is precise delineation of responsibilities between departments. Next, will be developing work rules unique to the facility and securing required executive endorsement. Once staff size and structure effectively match operations goals, annual objectives, and assigned ownership of systems, tasks and processes may be completed. With assigned owners and dedicated time provided for procedures and training programs, multi-month projects may be conducted to complete these objectives. Staff retention incentive plans may be developed simultaneously with the procedures and training program efforts.

Here's how facility managers can incorporate each of these components into their critical facility operations strategy.

1. Clarity of task and process ownership. In most facilities, multiple departments are involved in delivering services to the organization's end customers. Those who operate and install computer hardware, those who manage networks, the security team, and the facilities group are all present in a typical data center. These groups often occupy their own designated spaces where some of their tasks and processes are performed. When these areas are physically separate and secured, it is generally understood which department is responsible for functions performed within, making written processes less critical.

Written processes are much more important when addressing areas where personnel from multiple departments have access. In the case of a data center facility, the computer room is most critical. Tasks performed there present the greatest risk of error, because multiple departments are involved and a higher frequency of human activity occurs within the room.

To reduce the high potential for error when multiple groups work together in one space, it is necessary to develop written mutual expectations between the departments involved. Some organizations refer to these as internal service level agreements. The documents can be as simple as one page, but must be endorsed by each department head and be consistently enforced. (See "Example of Internal Service Level Agreement," right.)

A significant level of detail is needed in establishing "ownership" of key functions such as power distribution and master planning (the location of computer hardware devices for optimal cooling and performance). Without it, interruptions to the operation may become common. Interruptions most often occur when someone that does not have knowledge, training, and experience with the proper procedure attempts to install or remove a computer device.

Internal Service Level Agreement

Facilities Operations Commitments:

  • Ownership of electrical power path up to the remote power panel connections (to whips) - only three individuals designated for this work
  • Single designee to share computer room master plan ownership with IT counterpart
  • Escorts provided for any facilities systems contractors working in building
  • Monthly updates to IT on load vs. capacity for each infrastructure system
  • Shared expense and capital budget planning
  • Thorough methods of procedure prepared, approved, and rehearsed in advance of scheduled maintenance activities that will involve risk or reduced redundancy
  • Incident reports provided to IT contacts within 4 hours of any near miss or downtime event, utilizing a consistent format (follow-up report issued as root cause is identified)
  • Updates every 30 minutes when an unexpected facilities event is in progress

Information Technology Commitments:

  • Ownership of all network connections and all power connections within server cabinets - only five individuals designated for this work
  • Single designee to share master plan ownership with Facilities counterpart
  • Escorts provided for any computer hardware and network contractors working in building
  • Weekly updates to Facilities Operations on contemplated computer hardware additions
  • Annual updates to Facilities Operations on computer hardware long term strategy
  • Shared expense and capital budget planning

_____________________________     __/__/__

Information Technology Manager     Date

_____________________________     __/__/__

Facilities Operations Manager          Date

Contact FacilitiesNet Editorial Staff »

  posted on 8/6/2013   Article Use Policy

Related Topics: