EQ woes go international

January 08, 2011

Well that was a day and half

Started at 05:30 this morning, first early shift in the rotation and I never sleep well, awake at 4am with usual alarm clock not going off panic.

Dell engineer due in at 8am, did a quick firmware upgrade on our out of sorts EqualLogic sumo. A triple bypass is in the pipeline – disk, controller and SATA card.

Engineer turns up at 11am, not actually a Dell man but a sub-contractor, nice enough but filling me with confidence.

Disk replacement goes well, well I could have done that, controller not so perfect but get ther in the end after a quick call to Dell EU central in India, who then talk us through the SATA card replacement. This is where it all went a bit Pete Tong.

The support guy was telling us to pull the left hand card the unit was saying right and omg….. It’s not fallen over. However when we enabled the SATA card the brown stuff hits the fan.

Suddenly we are looking at 6 failed disks. And an array that is off line and all our DEV and QA systems down. 12:30pm.

In case you don’t know about the internal workings of one of these units there are 48 disks in 4 raid 5 arrays, three 14 disk arrays and a 4 disk array with two spare disks, so 6 failed disks are a bit of a calamity of epic proportions.

After a lot of mucking about restarting everything, swapping controllers, swapping back the old hardware we found ourselves transferred to EqualLogics Central US. Which I think is New Jersey.

Now we are looking at 3^rd line support now and I am laughing.

It’s weird, I spend most of my life getting stressed and grumpy about inconsequential rubbish but when it all goes tits up the adrenaline kicks in and I just feel happy and relaxed.

First we need to get a remote session running so they can see what’s what, but US WebEX is dicking around, Indian WebEX was ok and Irish WebEX is ok, but US WebEX green screens.

We start some diagnostics, which we have to restart a couple of times and exclude a couple as they freeze the process. Send the output to the US.

In the meantime Dell Ireland gets a WebEX session up and the US manage to join in. They can now run some quick stuff on the command line and talk amongst themselves. 4pm.

By now I have had a couple of conversations with my boss, his boss and the director chappy to keep them all in the loop. Worst case scenario is a restore of the SAP QA and Dev servers and a rebuild of the 30 odd Dev and QA servers. Big Boss man is philosophical about it all, boxes are ticked on the process, and we are where we are.

It’s going to take about 2 hours to review all the diagnostic data, so I go home for food and a clean-up as I have spent 4 hours in the server room and haven’t eaten since 8am. 4:30pm

Return at 5:45, Catch up with where they are and notice everyone goes home. This I see as a good thing as they trust me to oversee someone else fix this steaming pile.

L3 are confused and try many things. They talk amongst themselves a lot whilst I am on hold. But hey they don’t seem to have seen this before so I go easy on them. 6pm.

More test, suggestions then the ultimate, “I think we need to get Engineering in here to look at this because it doesn’t make sense” Not to me either as there an array with a bunch of 12 numbered disks and two numbered F.7pm.

The propeller heads arrive like the best Wild West cavalry, all big hats and trusty steeds in to the L3’s cube.

After another hour of a whole heap of command line stuff, the engineers hand over a unit that was in the same state is was when we started this morning, except slightly worse because all the failed drives were rebuilding.

Now, to me, if you have a raid 5 array and 2 disks go that’s it you’re dead. But in the EQ world that is not the case, we have one array with three disks rebuilding and reduced performance in the meantime. Still can’t complain too much as that is somewhat better than where we were an hour ago.

3 disks are to be replaced, but we are pretty much back where we started. 9:30pm

What about the servers I hear you ponder, well they are all VMware and apart form one out of the 30 they all seem to have coped without any problems. Every 25 minutes or so you see event log entries for disk problems then they just say oh look disks lets feed, and everything is right with the world. Just the one server looks like it crashed, not too shabby considering that spent 7 ½ hours without any disks.

16 hours from start to finish and nowhere further fixing the initial problem of a disk that won’t talk to a controller.

Let’s see what Monday brings.

PS A big thanks to the Dell US guys that stuck with it, and seem to have fixed it. Also the Dell Irish guys for sorting the WebEX problems.

Search This Blog

The Grumpy IT Admin

EQ woes go international

Comments

Post a Comment

Popular posts from this blog

Enterprise Vault - Failed Exchange Task

Scripting DNS entries

Windows Phone to iPhone - a painful transition