Thursday, July 24, 2008

Crash!

Gas is at 3.669 at 4 AM this morning.

Server crash at work. Came in 7 AM Wed morning, about 8 AM the SCSI alarm started sounding, looked and the RAID5 array was in impaired mode. Before we could replace the failed drive with a spare a second drive failed. We discovered that the fan that pulls air thru the drive bay had quit running; we connected it to a different jack on the main board and it started running again. Put the drives in my fridge for a bit, then started up the server again, all drives came up but the array was down. Started a rebuild. After 6 hours the fan died again and crashed the array. Nothing wrong with the fan, no power from the jack on the main board. Attached another fan, this time connected to one of the drive power connectors, rebuilt the array again. When the rebuild completed at 2 AM there was no data on the array.

So, let's try this. Shoved in an IDE drive with Win2K on it, restored the backup to the drive array. "Boot Failure". Inserted the W2KSP4 Server CD and ran the recovery console, damaged partition. Created the partition, booted from the IDE drive and restored the data to it. "Boot Failure", recovery console showed damaged partition. Install Windows Server, try to restore over top. NTBackup crashes partway thru the restore, 3 times.

Well, Windows is running fine, so let's restore the data, then we will re-install the rest of the software. It's 7 AM now. I went home last night at 5 PM while the others stayed and worked on it, I came back in at 4 AM. We have 35 GB of data to restore, so it will take some time. Updates later.

7 AM Renamed the server with the correct name, and joined it to the domain. Then the system state could be restored, along with the rest of the data.  Didn't have to reinstall software.

11 AM: Server came back online at about 9 AM, a couple of minor glitches not least of which was that there were some problems with our contracted AD/Exchange provider, we are still having problems because of that (no email; can't join server to domain). Sent the other guys home shortly after 8 AM, they were getting punchy.

1 PM Our AD provider appears to have their problems sorted (but they STILL haven't returned my call) and the server is now stable. Finally.

28 hours is FAR TOO LONG for a server of this type to be down. It basically shut down two entire departments. A couple of items to note:

1. This server is 6 years old

2. It was built from a standard PC by mounting the main board in a rack box and adding a SCSI card.

Considering these items and the fact that it has run well for those years with occasional reboots, it has done extremely well. It looks as though we may be allowed to order a true server as a replacement before the end of this week. When this was originally built, they needed the server NOW for a specific purpose, but the project was over budget when they determined that they needed a server to run the app?!. Now, it is a mission critical server so they will finally part with the money.

Amazing what happens when the server hosting the accountant's spreadsheets fails. And no, that had nothing to do with it. This was not premeditated; we couldn't have planned something like this. Others create enough headaches for us, we don't need to create our own.

No comments:

Post a Comment