8 Steps to Reliability During the Pandemic
Facility managers responsible for data centers and other critical facilities are facing additional challenges during the coronavirus pandemic. Most operations and maintenance tasks have become more challenging. As an example, if critical equipment offers good onboard status reporting but limited remote capability, the facility manager could ordinarily rely on increased rounds, knowing that onsite staff will hear or see alarms, unusual sounds, or smells or readily experience the result of issues (such as a power bump). But with less than a full onsite staff, there is a greater risk of issues being missed.
The use of documentation is another example. Facility managers learn the interconnection of equipment very well over time and often don’t need to refer to manuals or construction drawings when they can see equipment, conduit and pipe runs, nameplates, posted flow diagrams, etc. But it’s a different story when a facility manager working remotely, without accurate as-built documentation, has to explain an issue to service providers.
Here are eight areas that facility managers responsible for data centers and other critical facilities should address to ensure reliability during the pandemic.
1. Managing changes
Implementing upgrades, changes, and consolidations may be more difficult with limited onsite access. But the freedom to work with less disruption of normally onsite personnel and processes may be worth overcoming challenges. Certainly, planning and design can be performed now, largely from remote locations, especially by engineers, vendors, and contractors with prior site knowledge. Temporary work during drawn-out construction projects, however, may need to be implemented so that it lasts much longer than planned (months may become years). For critical facility design, emerging guidelines including those by the International Data Center Authority (IDCA) should be consulted in addition to more established guidelines by the Telecommunications Industry Association (TIA) and the Uptime Institute.
It may pay big dividends to be proactive about reactive workarounds. Implement and test backup workarounds, including docking stations and ports for roll-up rental generators, chillers, etc. Temporary cooling solutions should be planned and rehearsed, including providing power (typically a generator but not UPS-supported), airflow intake and exhaust ducting, and condensate drainage. Agreements with local providers of emergency rental equipment should be in place for priority response. The ability to issue a purchase order or other commitment quickly is key. If nothing else, have discussions with vendors and get budget proposals to allow for a faster process when under the gun.
Organizing and maintaining information, especially construction drawings, submittals, as-builts, operating procedures, maintenance logs, etc., is more important when facility managers don’t have the level of support from outsourced engineers, contractors, and service providers on which many have become overly dependent.
Facility managers should request documentation from their designers and service providers if they don’t already have it. Request all information to be provided in finished form as well as in native format (typically AutoCAD for construction drawings). When a CAD file is received, make sure it opens and includes required sub-files (X-refs, etc.). Have someone proficient with CAD and with the proper software open the files to confirm. Some outsourced providers may attempt to cling to this information. Selfishly, it represents job security and may provide them with an advantage vs. competition. However, as a facility owner, if you paid for development of the documentation, it is yours. Demand it.
4. Remote monitoring
With a reduced onsite presence, remote access to critical support equipment status has become increasingly important. Newer equipment can push status, loading, and alarms to the building automation system and directly to PCs and smartphones when they are networked. Many devices provide too much information, which needs to be winnowed down to what is important. Getting remote status on older equipment can be more challenging; upgrades may be available, but it may be more cost effective to refresh the equipment at the early range of reliable life expectancy.
Where equipment is not directly remotely monitored, be sure staff understands how to derive missing information from related equipment that is remotely monitored. For example, standby generator status may not be remotely available, but if the automatic transfer switch (ATS) reports remotely it will provide basic generator information: normal power failed, generator running, load connected to generator, etc. If neither the generator nor ATS is remotely monitored, some information about the status of each can be derived from cooling and uninterruptible power supply (UPS) equipment that is remotely monitored, which will issue alerts about input power failures and restoration. Another hint that the generator is running may be a UPS acting (and reporting) differently when on generator power than when on utility power; the supply frequency and voltage won’t be as rock-solid stable, although the voltage should settle very near the generator setpoint (i.e., 480V) whereas utility voltage may swing 20 volts or more from day to night. All of this can be important because a generator may continue to run even after normal utility power is restored and stabilized, which could lead to running out of diesel fuel and other long-term failure risks.
If you are going to rely more heavily on remote monitoring, find out whether there is enough bandwidth to allow for remote access and whether effective security protocols are in place. Ensure network security and vulnerability protocols are up to date, especially for the BAS. Details include firewall communications limitations, secure virtual private network, secure web-based external access process, multi-point access authorization, implementing credentialled access prior to connecting to the server, and providing company-issued computers with up-to-date protection against malicious software.
5. Operational silos
Many critical facilities struggle with facility management and IT groups operating in silos. Communication is often the challenge. As an example, facility managers are not happy when new IT equipment shows up needing space, power, and cooling, without warning or apparent planning. IT folks are not happy when unforeseen maintenance increases risk of power or cooling failure or requires actual shutdowns. When in-person meetings are not an option, it is even more important for facility management and IT groups with differing priorities to find effective ways to regularly communicate and share each other’s challenges.
6. Working remotely
Empowering staff to work from home requires additional IT systems and requisite critical facilities, whether on or off premises. Workers who largely use the phone, computer, text, chat, etc., are more likely to work from home now and in the future than before. Why pack a conference room or auditorium full of people when meetings can be very effective, while physical distancing (even in other areas of the workplace), using web /on-line meeting technology such as Teams, Webex, or Zoom, and more recently with active cameras and effective use of mute functions. Traditional call “centers” may continue trending towards distributed workplaces, such as individual homes. These trends may well outlast the pandemic.
7. Data center strategy
Enterprises and institutions with on-premises data centers may accelerate colocation and cloud migration trends. It is difficult and costly enough in normal times for facilities to reliably operate in-house data centers alongside primary business functions. Challenges have increased during the pandemic. However, with almost every off-prem data migration there remains an onsite “leave behind” critical facility, be it a standard (less critical) main distribution frame or more production-oriented (more critical) server room approaching what is now sometimes called an edge data center. Now may be the time to engage consultants who can work largely from remote locations, and who work with company IT staff who also may need to remain remote, to develop or update data center short- and long-term operating budgets and strategies, as well as disaster recovery/business continuity posture.
8. Maintenance and parts
With onsite facility engineers limited, facility managers need to ensure that critical support-equipment is properly maintained, that critical spare parts are onsite or readily available, and that emergency processes are in place internally and with trusted vendors. Local service providers may be preferred for some time, given increased air travel challenges for providers based out of the immediate area.
It’s important to understand the consequences of deferred maintenance for equipment such as batteries, filters, IR thermal-scanning devices, testing, retro-commissioning, etc. In normal times, the run-to-failure approach can save cost short-term but add significant long-term cost, and more importantly increase risk of disruption down the road. It may be even more tempting during the pandemic to defer maintenance, but the risks of doing so are even greater.
The serviceable life of cooling equipment can be prolonged with stepped up air filter change-out frequency. This not only saves energy but reduces stress on motors, extending life, reducing sound pressure, etc. Don’t linger with end of life UPS batteries or generator-engine-cranking batteries. Maintain spare parts onsite or confirm local or regional stocking with vendors. The pandemic has disrupted shipping and production logistics worldwide. Understand and plan for key parts that may be sourced overseas. In a data center that has rather exotic equipment, rather than sourcing key replacement parts (sometimes at a 500 percent markup) it might make more sense to purchase and store a complete spare unit for its parts or wholesale changeout.
With some local jurisdictions and corporations holding to stay-at-home directives, service providers may find it difficult to execute an emergency response plan. Talk to vendors about emergency response — can someone open the shop to get a part on a Sunday morning? Establish an emergency process adhering to current pandemic protocols. Having an effective escalation process in place can help minimize unplanned downtime.
With most businesses operating during the pandemic relying heavily on data availability, data centers and other critical facilities must be kept up and running. Now may be a good time to take inventory of documentation, service provider, capability and escalation processes. Now might also be the time to study and plan critical facility changes, large and small. Maintaining current construction drawings, equipment maintenance records, and other documentation can help in an emergency. Remote access to critical and critical-support equipment allows for proactive or better reactive response when limited physical support is available. And remember that keeping up with good maintenance procedures can reduce risk of unplanned downtime.
Michael Fluegeman (email@example.com), PE, is director of engineering and principal, critical facilities, for PlanNet, an independent professional services firm that provides advisory, design, project management, and construction services supporting critical IT infrastructure.
Antonio “Tommy” Tan III (firstname.lastname@example.org), PE, is a critical facilities mechanical engineer at PlanNet.