TIMS Tiered Infrastructure Maintenance Standards
TIMS Tiered Infrastructure Maintenance Standards
TIMS Tiered Infrastructure Maintenance Standards
by Bob Woolley
Contents
> Executive summary Click on a section to jump to it
Introduction 2
Inadequate maintenance and risk mitigation processes
can quickly undermine a facility’s design intent. It is, Describing the framework 2
therefore, crucial to understand how to properly
structure and implement an operations and mainte- Evaluating a maintenance 4
nance (O&M) program to achieve the expected level of program
performance. This paper defines a framework, known
Interpreting the results 6
as the Tiered Infrastructure Maintenance Standard
(TIMS), for aligning an existing or proposed mainte- Conclusion 7
nance program with a facility’s operational and perfor-
mance requirements. This framework helps make the Resources 8
program easier to understand, communicate, and
Appendix A: Structured 9
implement throughout the organization. Maintenance program checklist
Appendix B: Glossary 11
by Schneider Electric White Papers are now part of the Schneider Electric
white paper library produced by Schneider Electric’s Data Center Science Center
DCSC@Schneider-Electric.com
A Framework for Developing and Evaluating Data Center Maintenance Programs
Introduction Billions of dollars have been spent building highly redundant data center facilities in order to
deliver high availability IT solutions to an increasingly information- reliant world. These large
investments have produced a variety of sophisticated facility infrastructure designs that are
inherently reliable and progressively more energy efficient. However no facility design,
regardless of how well planned and constructed, can withstand the disruption of a poorly
designed or implemented Operations and Maintenance (O&M) program. Inadequate
maintenance and risk mitigation processes can quickly undermine the facility design intent. It
is therefore crucial to understand how to properly structure and implement an O&M program
to achieve the level of performance for which the facility has been configured. This paper
defines a method for aligning the operational requirements of the business with maintenance
program standards that can be easily understood, communicated, and implemented through-
out the organization.
In response to this need, a simplified framework for classifying operations and maintenance
programs for mission critical facilities is presented in this paper. Called the Tiered Infrastruc-
ture Maintenance Standard (TIMS), this system provides a straightforward method for
evaluating the maturity of an O&M program (existing or proposed), gives an understanding of
the associated level of risk, and helps effectively communicate these concepts throughout the
organization. An understanding of TIMS facilitates the development of a maintenance
strategy that aligns with corporate data center performance goals, in a way that is transparent
to everyone involved in the operation, administration, and management of the mission-critical
environment.
Describing the The framework is comprised of four maintenance service tiers or levels:
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 2
A Framework for Developing and Evaluating Data Center Maintenance Programs
Operating at this level implies that the perceived cost of an outage is low compared to the
cost of preventative maintenance. Unfortunately, when budgets are tight, deferring mainte-
nance is often viewed as a way to cut cost. This is a risk calculation similar to forgoing
medical insurance because you are feeling healthy, which can have catastrophic results.
Statistically speaking, any perceived, short-term savings in maintenance costs will likely be
overshadowed by costly outages and expensive repairs over the long run.
In many cases, a lack of system redundancy forms the justification behind a run to fail
strategy when the ability to perform maintenance is restricted without removing a portion of
the critical load from service. This, for example, could be the result of a single point of failure
in a switchboard, or a PDU feeding single-corded servers. Ironically, this approach guaran-
tees that an outage will occur when (not if) a system component fails, possibly for an
extended period of time.
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 3
A Framework for Developing and Evaluating Data Center Maintenance Programs
Structured Maintenance utilizes best practices from every facet of the O&M environment and
integrates them into a program that is more than the sum of its parts. The goal is to system-
atically eliminate variables that can introduce errors. Maintenance activities at this level are
extremely proactive, controlled, and documented.
Note that it is not necessary to have a facility with a high availability tier rating to be able to
enact a Structured Maintenance program. Enacting Structured Maintenance will enhance the
performance of any facility design, as long as the program is fully enacted. In situations
where concurrent maintenance is not possible controlled shutdowns may be required, but this
is vastly preferable to an unplanned, uncontrolled shutdown that is preventable.
When Structured Maintenance is performed in this environment, the highest possible level of
reliability is achieved in the following ways:
• The ability to easily isolate redundant system components for comprehensive testing
and maintenance greatly increases reliability while minimizing the risk of downtime.
• Automated systems take some of the risk of human error out of the equation, and can
respond more quickly and accurately to sudden changes.
• Continuous monitoring of the critical systems and the ability to trend equipment operat-
ing parameters facilitates predictive and condition-based maintenance practices.
• Systems for managing asset and maintenance data provide tools for optimizing
maintenance planning and reporting key metrics used to track and improve equipment
reliability.
Evaluating a Having established the TIMS framework above, let’s take a look at how it can be used to
quickly and reliably evaluate the level of maintenance for an existing or proposed O&M
maintenance program. Below are a list of tools and assets to refer to when doing this evaluation:
program
• Maintenance records
o Asset database/list
o Annual maintenance schedule
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 4
A Framework for Developing and Evaluating Data Center Maintenance Programs
With these items in mind, consider whether these tools and documents exist or not:
If they don’t exist or it is unknown whether they do or not, then the facility is likely operating
or going to be operating in a “run-to-fail” (i.e., TIMS-1) mode.
If those tools exist and are in active use by the organization, then the next set of items should
be carefully considered:
• Each equipment type has a documented scope of service that defines the maintenance
frequency and details the required work activities
• This information is used to create a detailed method of procedure (“MOP”, a.k.a.,
maintenance procedure) that is used to oversee each maintenance event
• Emergency procedures are developed to script emergency response activities for
probable/consequential system failures
• Drills are regularly performed to practice responding to these scenarios
• Documented checklist for facility walkthroughs
• Log used by the engineering staff to communicate across shifts
• Documented change management process that is followed during equipment installa-
tion and maintenance
• Documented training program that covers all of the site systems along with written
evaluations and annual re-certification processes
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 5
A Framework for Developing and Evaluating Data Center Maintenance Programs
If one or more of these items are missing, then the data center may be operating in an
unstructured (TIMS-2) environment.
If all of these exist and are actively being used, then the data center is most likely operating
with a structured (TIMS-3) maintenance program. Even better, if all of the systems are
concurrently maintainable, there is a functioning BAS/DCIM system with an EPMS capability
and a CMMS exists to facilitate maintenance; then the facility is operating at the highest
(TIMS-4) level from a maintenance and operations standpoint.
Appendix A at the end of this paper contains a more detailed checklist that can be used to
identify the elements of a Structured Maintenance program. While not an exhaustive list, you
can use this to perform a quick self-assessment to determine if your maintenance program
meets the TIMS-3 criteria. Note that each item on the list must be actually observed, not
simply reported to be in place. Being “observed” should mean that processes, programs, and
procedures are all documented and in active use, and not just reported as “understood” or
“occurring”.
Interpreting the While it isn’t possible to provide a scoring system that works for every circumstance, it’s safe
to say that if you are missing more than one or two elements; your program has not yet met
results the overall criteria for designation as TIMS-3 Structured Maintenance. In practice, few
maintenance programs fall neatly into a single category as described in the preceding
sections. More often, there will be elements of two or more maintenance tiers being exhibit-
ed. For example, a program might embrace Structured Maintenance on the electrical
systems, but exhibit Unstructured Maintenance methods on the HVAC plant by not utilizing
MOPs or good change management practices. Another example would be a facility that
incorporates Structured Maintenance methods across the board, but has a single switchboard
that cannot be maintained without interrupting electrical service, and is operated in “run-to-
fail” mode due to the inability to schedule a maintenance window with end-users. In cases
such as these, the weakest link principle applies: overall service level is only as high as the
lowest level of maintenance being performed in any critical area of the facility.
The evaluation process described above will provide a quick indicator of an O&M program’s
alignment with industry best practices for mission-critical facilities. Due to the complex nature
of these programs, it may be necessary to perform a more thorough analysis to develop a
complete understanding of their strengths and weaknesses. Such an audit should be
performed by a mission-critical facility specialist, either as a stand-alone service or as part of
a comprehensive facility assessment. Independent audits of the O&M program are in
themselves a best practice for ensuring program effectiveness and will pay for themselves
with increased reliability, uptime, and efficiency.
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 6
A Framework for Developing and Evaluating Data Center Maintenance Programs
If you are looking to increase the reliability of your facility whatever its Tier rating, applying
TIMS-3 principles to an existing infrastructure will minimize risk and maximize your bottom
line. It could even be argued that a Tier II facility operating at TIMS-3 can be more reliable in
than a Tier III facility operating at TIMS-2, given the likelihood of higher incidences of human
error in the later example.
Considerations
In preparing to undertake the establishment of an effective O&M program as defined by
TIMS, the following items must be considered:
1. Scope: What specific actions need to be taken to achieve the desired TIMS tier?
2. Budget: Does your budget allow you to meet your chosen goals?
3. Skills: Do you have the internal skills to manage and perform the activities required?
4. Impact: What is the impact on your business operation to implement the plan, and
what are the risks?
Conclusion An organization’s cost of downtime and risk-tolerance level must first be established in order
to determine which TIMS level best matches their goals. This knowledge is a prerequisite for
the development of a realistic maintenance program. Ultimately, the TIMS level achieved will
be determined by resource availability and the commitment of the organization’s manage-
ment team to implement and maintain the program over the long term.
When evaluating the entire scope of the mission-critical enterprise, the effectiveness of the
maintenance program is one of the key components that must be factored in to determine the
true level of sustained reliability. The tremendous variability in how maintenance is imple-
mented can make it difficult to judge what constitutes the proper level of service in a given
situation. Defining maintenance levels and using that to evaluate a given O&M program as
described in this paper is a tool to achieving such an understanding.
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 7
A Framework for Developing and Evaluating Data Center Maintenance Programs
Resources
Click on icon to link to resource
Browse all
white papers
whitepapers.apc.com
Browse all
TradeOff Tools™
tools.apc.com
Contact us
For feedback and comments about the content of this white paper:
If you are a customer and have questions specific to your data center project:
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 8
A Framework for Developing and Evaluating Data Center Maintenance Programs
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 9
A Framework for Developing and Evaluating Data Center Maintenance Programs
NOT
CATEGORY ITEM OBSERVED
OBSERVED
EFFICIENCY
Performance Benchmarking
Airflow Management Procedures
Energy Efficiency Measurement and
Reporting
Systems Optimization Procedures
Continuous Improvement Program
DOCUMENTATION
Document Management Program
Accurate Drawings
Critical Facility Work Rules
Facility Walk-Through Checklist
Standard Operating Procedures
Administrative Procedures
Shift Turnover Procedures and Log
OPERATIONS MANAGEMENT
Services Scope Description
Staff Roles and Responsibilities
Vendor Management Procedures
TRAINING
Training Requirements
Qualification Standards
Certification Program
Individual Training Records
Lessons Learned/Near-Miss Program
Ongoing Education Program
OPERATIONAL SUPPORT SYSTEMS
Work Order Management System
Electronic Document Management
System
REPORTING
Weekly Report
Monthly Report
Quarterly Performance Report
Project Report Template
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 10
A Framework for Developing and Evaluating Data Center Maintenance Programs
• Asset Database: a comprehensive list of the facility systems and equipment, including
Appendix B: make, model, serial number, capacity, location, system ID, and warranty information.
Glossary This is often part of a CMMS (see below).
• Building Management System (BMS): a system designed and implemented to con-
trol and monitor the functions of a building and its associated plant.
• Scopes of Service: a detailed listing of all the maintenance activities required for a
specific piece of equipment and the frequency of each activity. This list usually in-
cludes the manufacturer's suggested maintenance, but may also take into account the
equipment history, experience of the service personnel and special application re-
quirements.
• CMMS: computerized maintenance management systems (CMMS) are software appli-
cations that schedule, track, and monitor maintenance activities and provide cost, per-
sonnel, and other reporting data and history.
• DCIM: data center infrastructure management are systems the collect and manage
data about a data center’s assets, resource use, and operational status throughout the
data center lifecycle. This information is then distributed, integrated, analyzed and ap-
plied in ways that help managers meet business and service-oriented goals and opti-
mize the data center’s performance
• EDMS: electronic document management systems (DMS) are software applications
that scan, store, and retrieve documents that are used by an organization. In the facili-
ties environment, these documents are typically facility drawings, Operations and
Maintenance manuals, maintenance contracts, MOP's, SOP's, service reports, etc.
• EOP: an emergency operating procedures is a detailed procedure for an emergency
event that is either high in probability, consequence, or both. It is prepared in advance
in order to limit the severity and duration of the event. Such procedures are often re-
hearsed in drills that combine one or more EOPs, mimicking the behavior of actual
emergency scenarios where multiple failures may occur.
• Manufacturer's Recommended Service: preventative maintenance activities for
specific pieces of equipment as set forth in the manufacturer's Operations & Mainte-
nance instructions.
• MOP: A method of procedure (MOP) is a detailed work document that is utilized to
perform maintenance on critical systems. The MOP specifies what equipment is being
worked on, who will be performing the procedure, what tools and safety procedures are
necessary, describes the risk, lists the step-by-step procedure, identifies backout pro-
cedures and escalation protocols, contains authorization signatures, and records
maintenance data.
• Onsite Facilities Staff: dedicated on-site facilities staff that focuses on the site critical
systems. This group performs daily walkthroughs, manages vendors, and performs
some level of self-performed service. The facilities staff is responsible for creating and
maintaining all of the site documentation, including MOP's, SOP's, and emergency pro-
cedures. This staff may or may not be providing 24x7 coverage, depending on the lev-
el of service required.
• New Component Testing: pre-testing of components prior to installation in critical
systems. This testing can be performed on-site when possible, but may need to be
done at the factory with appropriate documentation provided.
• PPE: personal protective equipment
• Predictive Maintenance: maintenance activity that's designed to identify precursor
indications to equipment wear or failure. Early warning provided by predictive mainte-
nance can be used to budget and plan maintenance activities in advance of the need to
actually perform the service. This increases efficiency and reduces the risk of un-
planned outages.
• Quality System: an organization’s arrangements and resources for meeting quality
objectives. It is used to ensure the expected outcome of a particular service activity
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 11
A Framework for Developing and Evaluating Data Center Maintenance Programs
and to reduce the risk of service related failures. This includes utilizing a MOP review
process, pre-testing components prior to installation, quality checking the finished work,
and performing periodic program audits.
• Record Drawings: up to date architectural, electrical, mechanical, and equipment
layout drawings that accurately reflect the facility as it was actually built, plus any adds,
moves, or changes that have occurred up to the present day.
• SOP: a standard operating procedure (SOP) is a document that is used to describe
specific steps to be taken to implement a well understood and defined process. An ex-
ample would be putting a UPS into bypass or putting a fire system in test mode.
• Tier Rating System: a rating system developed by the Uptime Institute to classify
facility infrastructure reliability in four levels or tiers, from lowest (Tier I) to highest (Tier
IV).
• Training Program: a formal and comprehensive staff training program that defines
various levels of qualification along with a rigorous testing and certification process.
This is used in conjunction with a matrix that identifies specific maintenance tasks and
what the qualification levels are for performing them.
• Vendor Management Program: a systematic program of vendor identification, selec-
tion, management, and evaluation. The purpose is to find competent vendors, docu-
ment their qualifications, clearly specify the scope of their activities, obtain competitive
pricing, monitor their performance, and provide feedback.
• Walk-through Check List: a detailed list of critical systems and facility infrastructure
equipment, containing fields for inputting data (such as voltage, temperature, and pres-
sure) or status checks. This list is used to perform periodic walk-throughs of the facility
to monitor status and create a written record of critical system settings and values.
Schneider Electric – Data Center Science Center White Paper 178 Rev 0 12