The head groundskeeper of the Reno Aces uses social media to recruit Gen Z into the field
The complimentary Elite level registration provides access to all education and networking opportunities
The coronavirus pandemic elevated data centers from resource to utility. While the economy took a severe blow, the shutdown would have had significantly greater negative impacts were it not for our ability to carry on with work, learning, shopping, and entertainment online, from home. There is widespread agreement that remote work will continue for a long time and, in many cases, is here to stay.
For that reason, data centers will continue to serve as utilities. Just as assuredly as we expect the lights to come on when we flip the switch, we expect to be able to connect to online content, classrooms, and corporate networks on demand and without fail.
While data centers have become far more reliable, thanks to a heightened priority placed on uptime and evolving resiliency strategies, outages continue to occur. Most are minor but with so much critical data hosted and stored in data centers, outages today come at a greater cost. The implications of an outage go beyond inconvenience and frustration to include heightened potential for security breaches, reputational damage, and massive revenue loss. Major outages at large companies can have a significant impact. In a few cases, losses in excess of $100 million in value have occurred. The cost of outages is going up.
Respondents to The Uptime Institute’s 2020 Data Center Resiliency Survey reported heightened concern about service outages compared to the prior year. Only about 5 percent of 529 respondents said they were less concerned now. The vast majority said they are as concerned or more concerned about outages than in prior years.
Between increased demand and more valuable data housed in data centers, the pressure is on to make strides in improving uptime. So, what’s the best way to go about that? Where should the industry focus its efforts to improve uptime?
Focus on People and Processes
In The Uptime Institute’s 2020 annual survey, 75 percent of respondents said their most recent downtime could have been prevented with better management or processes. The Institute asked survey participants what the most common root causes for human error-related IT outages were at data centers over the past three years and learned:
From there, root causes of data center failures included things like service issues (27 percent), installation issues (26 percent), staffing issues (22 percent), and insufficient preventative maintenance (20 percent). Most of these are a factor of poor leadership and putting cost before reliability—more difficult matters to unravel and rebuild.
For those committed to improving reliability these findings are good news in that they give us a clear path for focusing efforts to improve performance.
Opportunity for Improvement
It’s a known fact that more errors occur when people are performing highly repetitive tasks or in environments where there’s a lack of diversity to prompt mental stimulation and focus. Data centers are, by design, uniform with large halls filled with rows of racks that look much the same.
That consistency, standardization and uniformity is something of a double-edged sword. On one hand, it’s easy to mistake one piece of equipment for another, get lost in a sea of servers and accidentally conduct maintenance on the wrong piece of equipment. One good strategy is to use more differentiation in design and color-coding equipment and equipment rooms to minimize confusion and lessen opportunity for human error. On the flip side, standardized design and operation supports repeatable processes that can be built into a checklist to avoid downtime.
Improving Process, Preventing Error
A first strong step toward minimizing downtime resulting from human error is as simple as validation processes or checklists. While that may sound simplistic, think of how critical checklists are to the military, the nuclear energy industry, surgeries, aviation, and other industries.
In our multi-tasking, distracted world, rigor is everything. The simple process of checking a box creates consistency, ensures that steps are followed and processes are completed to eliminate room for error. This pays dividends in the data center environment.
In this technology-based industry, checklists live on handheld digital devices, ideally with two-step authentication and validation for critical steps on a given checklist. This structure all but guarantees the efficacy and reliability of a host of data center operational processes.
Suggested checklists for data center operations include:
The best processes and tools only work if you have a team trained to use them. In addition to creating the processes to circumvent human error and installing a system for verification of those steps, the facility must invest in training its staff. Data centers are highly complex and interconnected, training programs and exercises among the different groups that support the facility is a must.
Checklists are, and will continue to be, a large part of ensuring preparedness. Recall Sully Sullenberger who landed an Airbus A320 on the Hudson River in New York City 12 years ago and saved 155 lives. He used a checklist to land that plane, even in desperate conditions.
The humble checklist has prevented many disasters in high-risk industries and should be exploited in data centers to achieve maximum uptime. Even as technologies like machine learning and artificial intelligence gain prominence in operations, the data center staff will continue to play a large part in operating data centers and can maximize their effectiveness with a clear, well-documented list of processes, procedures, and priorities.
The data center business can borrow lessons from the military, medical and aviation industries. By applying checklist rigor to the way we operate datacenters, we can improve uptime for these increasingly integral assets.
Sudhir Kalra is Compass Datacenters’ senior vice president of Global Operations. Prior to joining Compass, Kalra served as Executive Director, Global Head of Enterprise Data Centers for Morgan Stanley. Prior to Morgan Stanley, Sudhir was Director, Corporate Real Estate and Services – Global Head of Engineering and Critical Systems at Deutsche Bank where he was responsible for mission-critical support of a real estate portfolio comprised of over 30 million square feet.