UPS woes

I finally got around to to moving over our bigger VMs to our larger BL 690 blades after I had completed upgrading the RAM on them to get them up to 160GB each to give us some decent space for running some large SQL boxes with 32GB of RAM specified.

The kicker for this was that one Sunday morning around 3am we lost power to teh server room. Now we have two UPS, one for site and one for the server room. The server room one is about 4 years old and had been serviced about two weeks previously, to give you some perspective this is a steel box the size of half a shipping container that you see on the decks of those big ships.
However, at the appointed time everything in the control panel and all the capacitors went bang, we know not why and our other UPS was all happy to carry on. But basically we had a power blip that caused all the storage and all the host to crash.

As all our server are set to remain off in this situation to avoid multiple crashes and allow us to bring everything up in the correct order (SQL then app servers etc) it was a case of sorting out all the power and making sure we had all the feeds were working, all the breakers were set and we were good to go. The storage was already up and the all but one blade was powered on, this required a pull and re-seat fro it to spark in to life.
Now we could play hunt the VC server. As part of our DR strategy I use a utility called RVTools that can be run against the VC's database and can produce all teh stats that you could want about the state of your VMware environment, the best bit is that you can run a script to out put this info to a spread sheet which you can email off or copy to another location, in our case we do this to our DR site, this means I can load up the spread sheet and I can find which Host the VC server and database server are on at 6am the previous morning and then connect to those hosts and boot the servers then we can manage the start up of the servers in a nice ordered way. Before booting them I migrated the relevant large servers to the new blades and booted them there. I also moved the spare 5th blade to the Dev environment.
I thought about upgrading to 4.1 as well but the upgrade wasn't working so I just went with getting the servers back, upgrading the tools on the larger server I had migrated to new cluster.

All the servers came back with no issue and all was well. However we had a blown UPS in an unknown state. Power is not an IT responsibility, and we could find no 24/7 support numbers, and those we had just went to answer phone, so were left with a fingers crossed situation until Monday morning.

Come Monday and we got a guy in to inspect the damage, it was terminal. Lots of parts needed, so when could they get them here 1 hour two? End of the week if we can find some in Europe was the answer.

As you can imagine the reports of high solar flare activity and likely power fluctuations caused a few sleepless nights and on Tuesday afternoon we lost the whole office as a spike tripped out the site UPS and the ware house and the offices went down, but amazingly the server room was unaffected.

All the parts turned up and UPS fixed and switched on. and we now protected again.

Moral of the story is that when someone else buys something to protect your equipment make sure that they sign up for the support you expect and that the equipment can be repaired in timely manner, don't rely on them to have the same ideas as you about importance and ask the big question, if the unit no longer works how long to replace or repair, and do you keep parts in your own stores in this country.

Comments

Popular posts from this blog

Scripting DNS entries

Enterprise Vault - Failed Exchange Task

Windows Phone to iPhone - a painful transition