How to keep your datacentre UPS infrastructure in the best possible shape
All business-critical IT infrastructure requires ongoing maintenance to ensure the maximum reliability and performance, and UPS systems are no different. As the old saying goes, “a stitch in time saves nine…” And yet it is conceivable that you could spend too much time on maintenance tasks that aren’t necessary, or which could be more efficiently undertaken by a qualified third party. So, how much is enough and where are the areas of highest concern? What should you be doing throughout the lifetime of your UPS system to ensure it always offers the protection you need?
What happens if you dont properly maintain UPS infrastructire?
Why are UPS Systems Important?
How much do UPS systems cost?
What kinds of UPS are there?
How do I go about choosing a UPS that’s right for my datacentre?
What happens if the UPS fails?
How can I prevent UPS failure?
Glossary of terms
What happens if you don’t properly maintain UPS infrastructure?
With UPS systems, the stakes are very high. It might sound like a cliche, but failure really isn’t an option. A badly maintained UPS system might fail to work at precisely the moment you need it to save your organisation from crippling and financially-ruinous downtime.
The other consequence of a poorly maintained UPS system, and especially of poorly executed UPS maintenance processes, is serious injury or death.
Yes, these things could kill you if you don’t know what you’re doing.
Poor maintenance also means you miss out on strategic opportunities to improve your operational IT performance and reduce inefficiencies and running costs. Instead of going from strength to strength along a timetable of your own choosing, you lurch from crisis to crisis along an unknown path of unexpected events.
On the plus side, UPS systems are based on well-understood, mature technology that is precision manufactured to high-quality standards in a competitive market. All of this means that UPS products have a high probability of working just fine for as long as you need them to. The major sticking point is consumables i.e. batteries. These wear out over time and may need to be replaced multiple times over the lifetime of the UPS system. However, there are other aspects of the UPS system that also require attention.
7 steps to UPS maintenance nirvana
Each UPS system requires a carefully chosen set of maintenance tasks. Overall, your approach to UPS maintenance should adhere to the following seven principles:
Assign responsibility for preventative maintenance
First up, everyone on your team needs to understand the value of preventative UPS maintenance in stopping the majority of downtime risks that your organisation will face. Maintenance might feel like a chore, but the alternative – i.e. it feels exciting – honestly doesn’t bear thinking about. Sparks flying and people screaming is the opposite of what we’re going for.
Every individual needs to know:
- Precisely what their UPS maintenance responsibilities are at a high level
- How that breaks down into individual tasks assigned to them
- What each task entails i.e. how to tell if the task has been completed or not
- How often each task needs to be completed and how long it should take
- Where and how to document that tasks have been completed
- Who will carry out their tasks during their planned or unplanned absences
- Who any issues, problems or observations need to be escalated to
This process will throw up more questions and hygienes that must also be addressed. Chief among these are that:
- Documentation is essential to describing specific actions and processes involved in completing maintenance tasks, to aid consistency
- A system for recording that tasks are complete is equally essential for establishing audit trails and supporting orderly workload handovers between individuals
- Individuals must be sufficiently skilled/trained/equipped to carry out their assigned tasks
- Tasks must be sufficiently resourced
- There must be clear lines of communications and chains of command
- There must be contingency plans in the event that tasks are not or cannot be completed
Devise a maintenance schedule aligned with your business expectations
Manufacturers offer maintenance packages alongside the UPS systems they sell, and details of these are easily obtainable, whether or not you subsequently decide to buy them.
However, even manufacturers would point out that such schedules are intended to be generic and cover a broad spectrum of requirements covered by ‘typical’ users. It might be too little for what your business needs. It might even be too much.
Some car manufacturers recommend servicing your vehicle every 20,000 miles. For others it’s 12,000 miles or even 10,000. The definitions of what constitutes ‘servicing’ also differ between manufacturers, and many even distinguish between types of service (i.e. ‘full’ versus ‘intermediate’) on the same service plan. The effect is two-fold: you can’t compare like for like very well, and the idea of knowing what’s appropriate for your needs has been lost in all the detail.
Car manufacturers don’t make service plans for taxi owners or track-day enthusiasts or Sunday-driving old grannies even though the needs of these groups are extremely varied. UPS manufacturers are equally hamstrung to generic maintenance programmes.
The question is, do you want to run your high-performance, high stakes UPS infrastructure to minimum standards, or are you better suited to a bespoke regime that befits your datacentre requirements? Do you want maintenance to be part-centric or objective-centric? High environmental governance standards or an extreme sensitivity to downtime are just some of the motivating factors behind developing your own unique schedule of UPS maintenance.
Know when to seek external help
Even a datacentre manager with a high level of confidence and capability around UPS maintenance will seek external skills for some tasks; either to fill gaps internally or to outsource responsibility against a defined service level.
As outlined above, UPS systems present a very dangerous physical environment where you can’t afford to cut corners.
Many organisations will use external contractors simply to avoid the safety risks to their own internal personnel.
However, this underplays the broader value that external expertise can bring to UPS maintenance programmes. Suitably qualified practitioners should be able to:
- Demonstrate a practiced, proven approach to maintenance tasks that translate directly into faster and more accurate completion
- Provide assurance of deep technical understanding having attended relevant courses and gained previous experience on your make and model of UPS
- Offer a measured, external perspective on your maintenance needs and posture
- Transfer knowledge of ‘tricks and tips’ to your internal personnel
- Use the appropriate tools and best practice approaches
- Comply with all necessary safety measures
- Fully document actions and recommendations
- Flag up solutions to problems as well as the problems themselves
- Source approved consumables and other replacement parts on your behalf
The other advantage to using external skills is being able to turn to them for general queries and to provide a sounding board for advice and recommendations for the future evolution of your UPS deployment in the context of your IT and business needs. Ultimately, you can hold them to account for any discrepancies and manage them against service levels that are readily available from similar competitors. Key to this is ensuring that responsibilities owned by external third parties and your own staff are clearly demarcated.
Insist on accredited components and skills
Providers of UPS maintenance comprise a fairly mixed bag, some of whom may be inclined to offer you an inferior service on the basis of reduced cost. The broader market will be able to offer:
- Individual maintenance tasks ‘on demand’ such as periodic battery testing
- Maintenance packages that do not use tools, components or practices approved by the UPS manufacturer
- Maintenance packages that do use manufacturer-approved protocols but that are offered by providers who do not hold up-to-date or appropriate manufacturer accreditation/s.
- Manufacturer-specified maintenance packages offered by accredited parties
- Bespoke maintenance programmes offered by manufacturer-accredited parties that exceed the basic scope envisaged by manufacturer packages
Accreditation is the antidote to uncertainty. Ignore accredited skills and parts and you’ll only have yourself to blame when something goes wrong.
Accredited UPS maintenance services from accredited providers tend to be more expensive than work commissioned from non-accredited providers and parts obtained from off-brand sources. This is because of the investments that go into keeping accreditations up to date. To achieve the top accreditations (e.g. APC Elite Partner level) requires extensive, ongoing training on the latest versions of the technology. This effectively makes the accredited providers as knowledgeable as the manufacturers themselves!
The other danger with failing to pay attention to accreditations is that you risk invalidating manufacturer warranties. A cautionary example concerns UPS batteries, which require periodic replacement within the UPS system chassis as part of ongoing maintenance activities. Replacing the original manufacturer-approved battery with a similarly rated off-brand equivalent may mean that the warranty for the entire UPS system is no longer enforceable. You may also be compromising the integrity of the UPS platform by using a product made to inferior standards and of unknown provenance.
Utilise remote monitoring/DCIM
Maintenance tasks can typically be scheduled months in advance but others might require more urgent intervention. In any event, undertaking UPS maintenance tasks usually means being present inside the datacentre doing something to the physical platform.
Time-pressured managers understandably don’t want to have to be onsite all the time, and can’t be expected to have eyes in the back of their heads. This can really increase the pressure on datacentre managers to get UPS maintenance right so that any necessary interventions can be spotted early and dealt with effectively.
The answer comes in the form of DCIM (Datacentre Infrastructure Management) systems that perform comprehensive, remote monitoring of numerous environmental metrics and proactively alert managers when they reach preset tolerances.
Because DCIM helps ensure that risks to uptime are anticipated and alerted, the net result is increased datacentre availability. However, there are other benefits to this technology including:
- Providing deep levels of present and historic visibility into datacentre status and performance in order to better inform future planning
- Establishing a centralised point of control for managing datacentre assets and undertaking necessary equipment changes; alleviating admin wastage and freeing up resources for more strategic activities
DCIM systems can also be offered on a utility ‘as-a-Service’ basis by third-party providers. Alerts could even notify your maintenance provider so it isn’t you who gets disturbed at 2am. All you have to do is read the incident report in the morning to find out how the various issues were successfully dealt with while you were asleep.
The other major advantage with DCIM is its ability to identify datacentre energy efficiency opportunities, thereby cutting running costs and environmental impact.
Make a plan and stick to it
Your UPS maintenance plan should include the following as a minimum:
- Constant measurement of temperature and humidity levels
- Diligent upkeep of clear maintenance logs, with periodic analysis of trends
- Visual inspections looking for signs of heat damage, corrosion and general wear and tear
- Close inspection and testing of connectors and distribution panels, tightening/ torquing where necessary
- Pre-emptive battery replacement ahead of anticipated end-of-life
- Additional battery testing to ensure load and discharge against safe levels, replacing with new supplies ahead of anticipated schedule where appropriate
- Cleaning of UPS enclosures to remove dust, moisture or chemical leakage
- Non-invasive thermal imaging to detect hotspots
- Annual system checks and complete operational shutdown/battery discharge
Use maintenance as a springboard for better PUE and uptime
Now you’ve established that UPS maintenance is a dynamic, preventative process rather than what you do after something has gone wrong, it’s time to embrace it within your wider IT planning activities.
Reviewing maintenance logs and integrating your maintenance processes with DCIM-driven datacentre governance should enable you to:
- Better manage change arising from evolution of IT equipment within the datacentre
- Demonstrate the impact and value of good UPS maintenance to business stakeholders
- Absorb the effects of UPS and other components reaching end-of-life
- Establish a sustainable basis for achieving lower PUE as part of energy efficiency initiatives
- Plan for additional short and long term UPS scale without disruption to operations
- Anticipate the need for enhanced UPS coverage in line with increased energy load
- Facilitate additional UPS redundancy to reduce downtime risks
Glossary of Terms
Datacentre infrastructure management (DCIM) solutions are used to monitor and control both IT and facilities management metrics within a single console.
Power Usage Effectiveness is a ratio expressing the efficiency of total power delivered to a datacentre facility to be used by the computing equipment within in. The lower the PUE (1:1 would be the lowest theoretically possible), the more efficient the datacentre is at converting its electricity consumption into value-generating IT-driven activity. Cooling datacentre IT equipment is typically the greatest challenge to achieving a low PUE.
Uninterruptible Power Supply. A battery-based hardware platform that provides a reliable and appropriate level of electrical power – typically to IT systems / datacentres – in the event that mains power is lost.
The track record of availability performed by IT systems over a given period. Uptime is expressed in percentage terms (e.g. 99.999% uptime) and normally covers one year.