What happened? And what now?

Post ImageNow that things are completely back online, there’s probably two questions you want answered. Dickson talked a little about the problems earlier today. I’ll do my best to address the questions in a little more detail here.

First, what happened? Quite simply, we had a hard drive failure, which while it isn’t that uncommon, certainly is a pain in the ass. Hard drives fail for many different reasons, and we don’t know the exact reason ours failed. What we do know is that we were left with a big problem. We had really old backups that worked properly, and we had newer backups that we assumed would work too, but they hadn’t been tested. Call it bad luck, or Murphy’s Law, or whatever you want, but our backups did not work. And the most recent and tested backups we had were on the hard drive that died. Lots of important data, and no way to get at it.

As you now know, we got the data back. How? Data recovery services. I think I learned more about hard drives and data recovery in the last two weeks than I ever thought possible, and certainly more than I ever wanted to learn. Data recovery services range from small and basic methods, to very large, sophisticated services with clean rooms, and a bunch of other high precision technology. Our drive required the latter unfortunately.

So now that the painful experience of losing a bunch of important data is behind us, with the data safely recovered (save for a corrupt Active Directory database that had to be rebuilt), what now? As you might guess, we never want to go through the experience again. So we’ve taken a number of steps to ensure we don’t have to:

  • We added a “hot” or “live” backup server. This means if the main server fails, everything automatically switches to the backup server so that we can fix things without having to take the systems offline.
  • We’ve added regular hard drive imaging to our set of backup tools.
  • We also added a new tape backup system, with regularly scheduled backups and restoration tests.
  • We’ve created hard copies of important system configurations and settings so that if something does need to be rebuilt or validated, we’ve got it on paper.
  • We’ve got spare components (including hard drives) ready at all times, so we don’t have to wait for the store to open to replace something if it fails. Our servers all share the same hardware, so this was a logical step.
  • And we’ve become a little paranoid about losing data, so we create images and burn stuff to DVDs at every possible opportunity.

Taken together, the things we’ve done should ensure that we don’t have such catastophic downtime or data loss ever again. You know what they say, never say never, but I think we’re in good shape.

I’m really sorry to all the bloggers and websites we host, and we thank you for your patience and understanding. Now, on with the show!

4 thoughts on “What happened? And what now?

  1. Everything is good now and I am glad you have done things to ensure this problem doesn’t happen again remember 1 additional thing; MAKE SURE YOU HAVE YOUR MOST RECENT BACK-UP TAPE OFFSITE. That is the proper procedure for proper disaster recovery.

  2. EASEUS DataRecoveryWizard is a complete range of data recovery software
    for all Windows operating system platforms and supports various file
    systems including FAT, FAT16, VFAT, FAT32, NTFS, NTFS5 on
    various storage media. EASEUS DataRecoveryWizard ensures safe and precise
    file recovery against numerous threats like accidental file deletion and
    disk formatting and so on.

    For more detail:
    http://www.easeus.com
    http://www.ptdd.com

Leave a reply to Mack D. Male Cancel reply