Do it once, do it right .. or ask someone who knows to check....
Why is it that people don't seem to be able to do things right first time? One of the most important aspects of the  System admin role is to make sure you have backups, and good backups at that, and if at all possible copies of your backups held many miles away, on different types of media....
One of our sites has had some problems with their backups, one guy got halfway through fixing it before disappearing off on holiday leaving no backups over the holidays and no end of year backups. Wouldn't mind being a fly on the wall when he has responsibilty explained to him on his return.
It was suggested that the remaining tech bloke on site should probably be looking to rebuild the server starting again from scratch, this drew the puzzling response of I don't know how to rebuild a server. That bang was my jaw hitting the deck.
Anyways it now seems to fall to me to fix the problem of a server that has issues with one of its online disks and a Backup Exec (why did Symantec buy all the good companies and turn them in to bloated shadows of their former selves?) install that is not entirely happy.
The server in question is about 200 miles away so this is all going to have to happen via RDP with some input of the local tech if required. First off was a firmware update to everything, then update the drivers and OMG I am starting to sound like a Dell tech support person……
This didn’t help much as although backup exec would happily inventory a tape, it would fall over (or more specifically the Job Engine would fall over) when trying to run a backup job, there were disk errors and reports of not being able to read a catalogue file. The disk in question is a couple of locally attached drives in a raid 1 config, however in another stunning example of not knowing what you are doing the drives are in slots next to each other, on a ProLiant DL380 G5 there are two controllers, one for the first 4 slots and one for the second so putting 4 drives in the first 4 slots is never going max your performance or give any kind of controller redundancy. There is also a third drive, for backup to disk testing, but this does not look like it has been set up correctly as it was not set as bound volume on the iSCSI initiator so Windows would happily start up as would all services before this drive became available, then services would fail or stall when they could not see the expected resources, way to go, guys.
Now we all know we should read the manual, and we all know they can be very large, difficult to read, and not actually contain any real information or clear instructions on how to do things but at least you can ask someone who knows what they are doing to check if everything is set up right.
Anyway back to the plot. Tomorrow will be a case of salvaging what we can from the server, recreate the failing array, copy everything back and see if that fixes the problem, otherwise we could be looking at a complete rebuild which is a pain with BE as restoring DBs and Jobs is a fraught business, I have only ever managed to recover catalogues before.
Comments
Post a Comment