TIMS Tiered Infrastructure Maintenance Standards

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

A Framework for Developing and

Evaluating Data Center


Maintenance Programs
White Paper 178
Revision 0

by Bob Woolley

Contents
> Executive summary Click on a section to jump to it

Introduction 2
Inadequate maintenance and risk mitigation processes
can quickly undermine a facility’s design intent. It is, Describing the framework 2
therefore, crucial to understand how to properly
structure and implement an operations and mainte- Evaluating a maintenance 4
nance (O&M) program to achieve the expected level of program
performance. This paper defines a framework, known
Interpreting the results 6
as the Tiered Infrastructure Maintenance Standard
(TIMS), for aligning an existing or proposed mainte- Conclusion 7
nance program with a facility’s operational and perfor-
mance requirements. This framework helps make the Resources 8
program easier to understand, communicate, and
Appendix A: Structured 9
implement throughout the organization. Maintenance program checklist

Appendix B: Glossary 11

by Schneider Electric White Papers are now part of the Schneider Electric
white paper library produced by Schneider Electric’s Data Center Science Center
DCSC@Schneider-Electric.com
A Framework for Developing and Evaluating Data Center Maintenance Programs

Introduction Billions of dollars have been spent building highly redundant data center facilities in order to
deliver high availability IT solutions to an increasingly information- reliant world. These large
investments have produced a variety of sophisticated facility infrastructure designs that are
inherently reliable and progressively more energy efficient. However no facility design,
regardless of how well planned and constructed, can withstand the disruption of a poorly
designed or implemented Operations and Maintenance (O&M) program. Inadequate
maintenance and risk mitigation processes can quickly undermine the facility design intent. It
is therefore crucial to understand how to properly structure and implement an O&M program
to achieve the level of performance for which the facility has been configured. This paper
defines a method for aligning the operational requirements of the business with maintenance
program standards that can be easily understood, communicated, and implemented through-
out the organization.

Tiered Infrastructure Maintenance Standard (TIMS) for Mission-


Critical Environments
While it may be commonly understood that a well-organized O&M program is required to
achieve data center performance and efficiency goals, it can be very difficult for those who
are not maintenance professionals to understand what such a program looks like. The
inherent resiliency of a facility will often mask operational deficiencies that have the potential
to negatively impact data center availability, performance, and efficiency.

In response to this need, a simplified framework for classifying operations and maintenance
programs for mission critical facilities is presented in this paper. Called the Tiered Infrastruc-
ture Maintenance Standard (TIMS), this system provides a straightforward method for
evaluating the maturity of an O&M program (existing or proposed), gives an understanding of
the associated level of risk, and helps effectively communicate these concepts throughout the
organization. An understanding of TIMS facilitates the development of a maintenance
strategy that aligns with corporate data center performance goals, in a way that is transparent
to everyone involved in the operation, administration, and management of the mission-critical
environment.

Describing the The framework is comprised of four maintenance service tiers or levels:

framework • TIMS-1: Run to Fail


• TIMS-2: Unstructured Maintenance
• TIMS-3: Structured Maintenance
• TIMS-4: Facilitated Maintenance

TIMS-1: Run to Fail


This level of service reflects the old adage, "if it isn’t broken, don’t fix it." Maintenance at this
level is reactive. When an equipment failure occurs, a maintenance technician is summoned
to perform the repair. Where system redundancy exists and is functional, there may be little
or no impact to the critical load for an isolated failure. However, the lack of a preventive
maintenance program will increase the likelihood of multiple concurrent failures, which can
take down even redundant systems.

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 2
A Framework for Developing and Evaluating Data Center Maintenance Programs

Operating at this level implies that the perceived cost of an outage is low compared to the
cost of preventative maintenance. Unfortunately, when budgets are tight, deferring mainte-
nance is often viewed as a way to cut cost. This is a risk calculation similar to forgoing
medical insurance because you are feeling healthy, which can have catastrophic results.
Statistically speaking, any perceived, short-term savings in maintenance costs will likely be
overshadowed by costly outages and expensive repairs over the long run.

In many cases, a lack of system redundancy forms the justification behind a run to fail
strategy when the ability to perform maintenance is restricted without removing a portion of
the critical load from service. This, for example, could be the result of a single point of failure
in a switchboard, or a PDU feeding single-corded servers. Ironically, this approach guaran-
tees that an outage will occur when (not if) a system component fails, possibly for an
extended period of time.

TIMS-2: Unstructured Maintenance


TIMS-2 maintenance programs are characterized by the performance of basic preventative
maintenance only, with little organizing structure to regulate how the work is conducted, or to
evaluate its effectiveness. The fact that it is commonly performed by qualified manufacturer’s
service representatives or trusted in-house technical staff can create a false sense of
security. Even qualified personnel can make mistakes, or focus too intently on individual
components without considering the system as a whole. This type of program may deliver
adequate results in some environments, but does not meet the stringent requirements of
mission-critical data centers. Unfortunately, it is a common practice throughout the industry.

Simply following manufacturer’s recommendations is no guarantee that all necessary steps


are being taken to maximize availability of the critical load. If the maintenance program lacks
a detailed scope of work for each piece of equipment that factors in system interdependen-
cies, chances are that important steps are being neglected. If Methods of Procedure (MOPs)
are not employed for critical systems, there’s an elevated risk of human error occurring during
maintenance events, where even experienced technicians can become distracted and
operate the wrong valve or switch.

A common characteristic of Unstructured Maintenance is an over-reliance on individual effort.


It can be reassuring to rely on a trusted individual who has been providing maintenance
services for years. However, it creates a high degree of risk when an organization’s facility
maintenance knowledge resides inside the head of individual technicians, who are suscepti-
ble to making mistakes no matter how experienced, and who may leave at any time and take
that critical information with them.

Another indicator of Unstructured Maintenance is a training program that almost exclusively


consists of shadowing more experienced workers for a period of time, after which they are
allowed to perform a wide variety of work without certification, testing, or formal training.
Unstructured, under-documented maintenance programs create an environment in which
maintenance can be somewhat haphazard, and the risk of human error is elevated.

TIMS-3: Structured Maintenance


The goal of Structured Maintenance is to maximize uptime by eliminating guesswork and
minimizing human error. This requires a degree of discipline and experience to execute
properly. Every part of the maintenance process is closely evaluated. Policies are estab-
lished to exert control over how and which information is gathered, acted upon, and recorded.
Programs are created to identify, train, supervise, and evaluate qualified personnel. Proce-
dures are developed to precisely manage how and when work is performed.

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 3
A Framework for Developing and Evaluating Data Center Maintenance Programs

Structured Maintenance utilizes best practices from every facet of the O&M environment and
integrates them into a program that is more than the sum of its parts. The goal is to system-
atically eliminate variables that can introduce errors. Maintenance activities at this level are
extremely proactive, controlled, and documented.

Characteristics of Structured Maintenance include a formal staff training program, a docu-


ment library that includes a scope of service and Standard Operating Procedures (SOPs) for
all site equipment; a change management program that utilizes detailed Methods of Proce-
dure (MOPs) for all maintenance activities; a robust emergency preparedness and response
program; a quality management system; and specialized support systems such as an
Electronic Document Management System (EDMS).

Note that it is not necessary to have a facility with a high availability tier rating to be able to
enact a Structured Maintenance program. Enacting Structured Maintenance will enhance the
performance of any facility design, as long as the program is fully enacted. In situations
where concurrent maintenance is not possible controlled shutdowns may be required, but this
is vastly preferable to an unplanned, uncontrolled shutdown that is preventable.

TIMS-4: Facilitated Maintenance


Facilitated Maintenance is the highest level of maintenance service. It combines a Structured
Maintenance program with a data center design that facilitates concurrent maintenance by
providing multiple power and cooling distribution paths with redundant components (i.e. Tier
III or above). Such a design allows individual pieces of equipment to be isolated and
maintained without a disruption in service. Another important component is a Building
Automation System (BAS) and/or Data Center Infrastructure Management (DCIM) system,
which continually monitors the critical infrastructure, trends equipment performance, alerts
operators when conditions fall outside preset parameters, and allows automated control of
equipment sequencing. Finally, the use of a Computerized Maintenance Management
System (CMMS) enables the efficient scheduling of maintenance events, as well as the
analysis and management of maintenance effectiveness.

When Structured Maintenance is performed in this environment, the highest possible level of
reliability is achieved in the following ways:

• The ability to easily isolate redundant system components for comprehensive testing
and maintenance greatly increases reliability while minimizing the risk of downtime.
• Automated systems take some of the risk of human error out of the equation, and can
respond more quickly and accurately to sudden changes.
• Continuous monitoring of the critical systems and the ability to trend equipment operat-
ing parameters facilitates predictive and condition-based maintenance practices.
• Systems for managing asset and maintenance data provide tools for optimizing
maintenance planning and reporting key metrics used to track and improve equipment
reliability.

Evaluating a Having established the TIMS framework above, let’s take a look at how it can be used to
quickly and reliably evaluate the level of maintenance for an existing or proposed O&M
maintenance program. Below are a list of tools and assets to refer to when doing this evaluation:
program
• Maintenance records
o Asset database/list
o Annual maintenance schedule

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 4
A Framework for Developing and Evaluating Data Center Maintenance Programs

o Maintenance records for the previous year


o Scopes of service (maintenance frequency and work description) for critical
equipment
o Scheduled maintenance service contracts
• Operational procedures
o Emergency operating procedures (EOP)
o Standard operating procedures (SOP)
o Methods of procedure (MOP) also known as maintenance procedures (MP))
• Operational processes
o Walk-through checklist
o Shift turnover log
o Change management process
• Training program
o Training program description
o Training materials
o Personnel training records
• Support systems
o Building management/automation system (BMS/BAS)
o Data center infrastructure management (DCIM) system
o Electrical power monitoring system (EPMS)
o Computerized maintenance management system (CMMS)

With these items in mind, consider whether these tools and documents exist or not:

• Accurate and comprehensive database or listing of critical assets


• Published annual maintenance schedule covering all assets
• Service records that correspond to each scheduled maintenance for the past year

If they don’t exist or it is unknown whether they do or not, then the facility is likely operating
or going to be operating in a “run-to-fail” (i.e., TIMS-1) mode.

If those tools exist and are in active use by the organization, then the next set of items should
be carefully considered:

• Each equipment type has a documented scope of service that defines the maintenance
frequency and details the required work activities
• This information is used to create a detailed method of procedure (“MOP”, a.k.a.,
maintenance procedure) that is used to oversee each maintenance event
• Emergency procedures are developed to script emergency response activities for
probable/consequential system failures
• Drills are regularly performed to practice responding to these scenarios
• Documented checklist for facility walkthroughs
• Log used by the engineering staff to communicate across shifts
• Documented change management process that is followed during equipment installa-
tion and maintenance
• Documented training program that covers all of the site systems along with written
evaluations and annual re-certification processes

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 5
A Framework for Developing and Evaluating Data Center Maintenance Programs

If one or more of these items are missing, then the data center may be operating in an
unstructured (TIMS-2) environment.

If all of these exist and are actively being used, then the data center is most likely operating
with a structured (TIMS-3) maintenance program. Even better, if all of the systems are
concurrently maintainable, there is a functioning BAS/DCIM system with an EPMS capability
and a CMMS exists to facilitate maintenance; then the facility is operating at the highest
(TIMS-4) level from a maintenance and operations standpoint.

Appendix A at the end of this paper contains a more detailed checklist that can be used to
identify the elements of a Structured Maintenance program. While not an exhaustive list, you
can use this to perform a quick self-assessment to determine if your maintenance program
meets the TIMS-3 criteria. Note that each item on the list must be actually observed, not
simply reported to be in place. Being “observed” should mean that processes, programs, and
procedures are all documented and in active use, and not just reported as “understood” or
“occurring”.

Interpreting the While it isn’t possible to provide a scoring system that works for every circumstance, it’s safe
to say that if you are missing more than one or two elements; your program has not yet met
results the overall criteria for designation as TIMS-3 Structured Maintenance. In practice, few
maintenance programs fall neatly into a single category as described in the preceding
sections. More often, there will be elements of two or more maintenance tiers being exhibit-
ed. For example, a program might embrace Structured Maintenance on the electrical
systems, but exhibit Unstructured Maintenance methods on the HVAC plant by not utilizing
MOPs or good change management practices. Another example would be a facility that
incorporates Structured Maintenance methods across the board, but has a single switchboard
that cannot be maintained without interrupting electrical service, and is operated in “run-to-
fail” mode due to the inability to schedule a maintenance window with end-users. In cases
such as these, the weakest link principle applies: overall service level is only as high as the
lowest level of maintenance being performed in any critical area of the facility.

The evaluation process described above will provide a quick indicator of an O&M program’s
alignment with industry best practices for mission-critical facilities. Due to the complex nature
of these programs, it may be necessary to perform a more thorough analysis to develop a
complete understanding of their strengths and weaknesses. Such an audit should be
performed by a mission-critical facility specialist, either as a stand-alone service or as part of
a comprehensive facility assessment. Independent audits of the O&M program are in
themselves a best practice for ensuring program effectiveness and will pay for themselves
with increased reliability, uptime, and efficiency.

Optimizing your maintenance program


If you discover as a result of this exercise that your maintenance program is not in alignment
with your business objectives, immediate action should be taken. This doesn’t mean that
every data center needs a TIMS-4 or even a TIMS-3 program. For instance, organizations
that deploy multiple Tier II facilities that are designed to maintain operations if one site fails,
may not require that level of effort to meet their business objectives. On the other hand, just
because you have a Tier II facility doesn’t mean that you shouldn’t be running a TIMS-3
program. It’s all about the criticality of the business your data center supports. As a rule of
thumb however, if you have made the investment for a Tier III or Tier IV facility, you should
be protecting that investment with a TIMS-3 or better program.

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 6
A Framework for Developing and Evaluating Data Center Maintenance Programs

If you are looking to increase the reliability of your facility whatever its Tier rating, applying
TIMS-3 principles to an existing infrastructure will minimize risk and maximize your bottom
line. It could even be argued that a Tier II facility operating at TIMS-3 can be more reliable in
than a Tier III facility operating at TIMS-2, given the likelihood of higher incidences of human
error in the later example.

Considerations
In preparing to undertake the establishment of an effective O&M program as defined by
TIMS, the following items must be considered:

1. Scope: What specific actions need to be taken to achieve the desired TIMS tier?
2. Budget: Does your budget allow you to meet your chosen goals?
3. Skills: Do you have the internal skills to manage and perform the activities required?
4. Impact: What is the impact on your business operation to implement the plan, and
what are the risks?

Conclusion An organization’s cost of downtime and risk-tolerance level must first be established in order
to determine which TIMS level best matches their goals. This knowledge is a prerequisite for
the development of a realistic maintenance program. Ultimately, the TIMS level achieved will
be determined by resource availability and the commitment of the organization’s manage-
ment team to implement and maintain the program over the long term.

When evaluating the entire scope of the mission-critical enterprise, the effectiveness of the
maintenance program is one of the key components that must be factored in to determine the
true level of sustained reliability. The tremendous variability in how maintenance is imple-
mented can make it difficult to judge what constitutes the proper level of service in a given
situation. Defining maintenance levels and using that to evaluate a given O&M program as
described in this paper is a tool to achieving such an understanding.

The Tiered Infrastructure Maintenance Standard offers a systematic approach to matching


maintenance activity levels with the level of reliability expected of the facility. Applying these
principals to your maintenance program is a crucial step in attaining data center availability
and business continuity goals.

About the author


Bob Woolley has been involved in critical facilities management for over 20 years. Bob served
as Senior Vice President Critical Environment Services at Lee Technologies and Vice President
of Data Center Operations for Navisite, as well as Vice President of Engineering for
COLO.COM. He was also a Regional Manager for the Securities Industry Automation Corpora-
tion (SIAC) telecommunications division and operated his own critical facilities consulting
practice. Mr. Woolley has extensive experience in building technical service programs and
developing operations programs for mission critical operations in both the telecommunications
and data center environments.

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 7
A Framework for Developing and Evaluating Data Center Maintenance Programs

Resources
Click on icon to link to resource

Browse all
white papers
whitepapers.apc.com

Browse all
TradeOff Tools™
tools.apc.com

Contact us
For feedback and comments about the content of this white paper:

Data Center Science Center


DCSC@Schneider-Electric.com

If you are a customer and have questions specific to your data center project:

Contact your Schneider Electric representative at


www.apc.com/support/contact/index.cfm

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 8
A Framework for Developing and Evaluating Data Center Maintenance Programs

Appendix A: CATEGORY ITEM OBSERVED


NOT
OBSERVED
Structured SAFETY
Maintenance Workplace Safety Program
program Hazard Analysis performed on all
work procedures
checklist Lockout/Tagout Program
PPE Inventory and Testing Records
Hazardous Materials Labeling
Hazard Communications Program
SECURITY
Vendor Access Control
Key Control Program
Vendor Personnel Orientation
EMERGENCY PREPAREDNESS AND RESPONSE
Emergency Operating Procedures
Emergency Drills
Escalation Procedures
Crisis Management Plan
Incident Logging and Reporting
Failure Analysis Program
MAINTENANCE PROGRAM
Comprehensive Asset Management
Database
Scopes of Service for Critical
Equipment
Preventative and Predictive
Maintenance Standards
Annual Maintenance Schedule
Spare Parts Inventory and Manage-
ment Plan
Subcontractor Selection Process
Test Equipment Calibration Records
CHANGE MANAGEMENT
Risk Analysis and Communication
Change Control Process
Notification and Alerting
Quality System
Methods of Procedure
Drawing Update Process
PERFORMANCE
Service Level Agreements
Key Performance Metrics
Performance Measurement and
Reporting Guidelines

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 9
A Framework for Developing and Evaluating Data Center Maintenance Programs

NOT
CATEGORY ITEM OBSERVED
OBSERVED
EFFICIENCY
Performance Benchmarking
Airflow Management Procedures
Energy Efficiency Measurement and
Reporting
Systems Optimization Procedures
Continuous Improvement Program
DOCUMENTATION
Document Management Program
Accurate Drawings
Critical Facility Work Rules
Facility Walk-Through Checklist
Standard Operating Procedures
Administrative Procedures
Shift Turnover Procedures and Log
OPERATIONS MANAGEMENT
Services Scope Description
Staff Roles and Responsibilities
Vendor Management Procedures

Materials and Tool Inventory

TRAINING
Training Requirements
Qualification Standards
Certification Program
Individual Training Records
Lessons Learned/Near-Miss Program
Ongoing Education Program
OPERATIONAL SUPPORT SYSTEMS
Work Order Management System
Electronic Document Management
System
REPORTING
Weekly Report
Monthly Report
Quarterly Performance Report
Project Report Template

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 10
A Framework for Developing and Evaluating Data Center Maintenance Programs

• Asset Database: a comprehensive list of the facility systems and equipment, including
Appendix B: make, model, serial number, capacity, location, system ID, and warranty information.
Glossary This is often part of a CMMS (see below).
• Building Management System (BMS): a system designed and implemented to con-
trol and monitor the functions of a building and its associated plant.
• Scopes of Service: a detailed listing of all the maintenance activities required for a
specific piece of equipment and the frequency of each activity. This list usually in-
cludes the manufacturer's suggested maintenance, but may also take into account the
equipment history, experience of the service personnel and special application re-
quirements.
• CMMS: computerized maintenance management systems (CMMS) are software appli-
cations that schedule, track, and monitor maintenance activities and provide cost, per-
sonnel, and other reporting data and history.
• DCIM: data center infrastructure management are systems the collect and manage
data about a data center’s assets, resource use, and operational status throughout the
data center lifecycle. This information is then distributed, integrated, analyzed and ap-
plied in ways that help managers meet business and service-oriented goals and opti-
mize the data center’s performance
• EDMS: electronic document management systems (DMS) are software applications
that scan, store, and retrieve documents that are used by an organization. In the facili-
ties environment, these documents are typically facility drawings, Operations and
Maintenance manuals, maintenance contracts, MOP's, SOP's, service reports, etc.
• EOP: an emergency operating procedures is a detailed procedure for an emergency
event that is either high in probability, consequence, or both. It is prepared in advance
in order to limit the severity and duration of the event. Such procedures are often re-
hearsed in drills that combine one or more EOPs, mimicking the behavior of actual
emergency scenarios where multiple failures may occur.
• Manufacturer's Recommended Service: preventative maintenance activities for
specific pieces of equipment as set forth in the manufacturer's Operations & Mainte-
nance instructions.
• MOP: A method of procedure (MOP) is a detailed work document that is utilized to
perform maintenance on critical systems. The MOP specifies what equipment is being
worked on, who will be performing the procedure, what tools and safety procedures are
necessary, describes the risk, lists the step-by-step procedure, identifies backout pro-
cedures and escalation protocols, contains authorization signatures, and records
maintenance data.
• Onsite Facilities Staff: dedicated on-site facilities staff that focuses on the site critical
systems. This group performs daily walkthroughs, manages vendors, and performs
some level of self-performed service. The facilities staff is responsible for creating and
maintaining all of the site documentation, including MOP's, SOP's, and emergency pro-
cedures. This staff may or may not be providing 24x7 coverage, depending on the lev-
el of service required.
• New Component Testing: pre-testing of components prior to installation in critical
systems. This testing can be performed on-site when possible, but may need to be
done at the factory with appropriate documentation provided.
• PPE: personal protective equipment
• Predictive Maintenance: maintenance activity that's designed to identify precursor
indications to equipment wear or failure. Early warning provided by predictive mainte-
nance can be used to budget and plan maintenance activities in advance of the need to
actually perform the service. This increases efficiency and reduces the risk of un-
planned outages.
• Quality System: an organization’s arrangements and resources for meeting quality
objectives. It is used to ensure the expected outcome of a particular service activity

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 11
A Framework for Developing and Evaluating Data Center Maintenance Programs

and to reduce the risk of service related failures. This includes utilizing a MOP review
process, pre-testing components prior to installation, quality checking the finished work,
and performing periodic program audits.
• Record Drawings: up to date architectural, electrical, mechanical, and equipment
layout drawings that accurately reflect the facility as it was actually built, plus any adds,
moves, or changes that have occurred up to the present day.
• SOP: a standard operating procedure (SOP) is a document that is used to describe
specific steps to be taken to implement a well understood and defined process. An ex-
ample would be putting a UPS into bypass or putting a fire system in test mode.
• Tier Rating System: a rating system developed by the Uptime Institute to classify
facility infrastructure reliability in four levels or tiers, from lowest (Tier I) to highest (Tier
IV).
• Training Program: a formal and comprehensive staff training program that defines
various levels of qualification along with a rigorous testing and certification process.
This is used in conjunction with a matrix that identifies specific maintenance tasks and
what the qualification levels are for performing them.
• Vendor Management Program: a systematic program of vendor identification, selec-
tion, management, and evaluation. The purpose is to find competent vendors, docu-
ment their qualifications, clearly specify the scope of their activities, obtain competitive
pricing, monitor their performance, and provide feedback.
• Walk-through Check List: a detailed list of critical systems and facility infrastructure
equipment, containing fields for inputting data (such as voltage, temperature, and pres-
sure) or status checks. This list is used to perform periodic walk-throughs of the facility
to monitor status and create a written record of critical system settings and values.

Schneider Electric – Data Center Science Center White Paper 178 Rev 0 12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy