dslreports borked by power outage, one week and counting
Posted: 24 Apr 2012, 07:16
http://www.dslreports.com has been down for a week and webmaster, Justin, posted a summary of events that led to this. It seems there are good and bad data farms, good and bad data recovery personnel, but I'll leave interpretation to the webmasters among us.
There is a google Doc page Justin wrote that requires google log in here and a copy/paste, below.
My take:
There is a google Doc page Justin wrote that requires google log in here and a copy/paste, below.
Here is the sequence of unfortunate events
Monday 16th
Main and UPC Power failure at data center (DS)
MD3000+MD1000 dual controller, dual host storage array containing 2TB of RAID10 volumes, containing basically everything, comes up confused
Working remotely, identify the key to restoration: 100gb, 50% of an XFS partition
Lacking an entire “dark” duplicate storage stack to bring up, get Dell involved
Tuesday 17th
Dell support says they don’t know the cause, but we must wipe entire array, do firmware upgrades, and start again. I don’t trust this gear.
Check backups, mail: ok, nfs:ok, site files: ok. The sql backup is incomplete.
Since all physical drives are green, and RAID10, so contain two copies, we request all 13 that contain this and other volumes, by carefully pulled, labelled, and shipped to Lab #1 (Colorado). Lab #1 (and #2) say they see MD3000 problems frequently. I am still relaxed at this point.
DS employee uses his initiative to decide that all disks are hot-pluggable, so pulls them all for shipping, with the equipment powered up. Getting a bit more tense.
Recovery info is presented to Lab #1 identifying LVM2 uuid, storage array config, etc
Wednesday 18th:
Lab #1 receives and images all drives ok
Thursday 19th:
Lab #1 fails to answer direct questions on whether scan of retrieved images shows missing UUIDs or not. Getting more tense still.
Estimated “success” bill rises to $8k
Friday 20th:
Lab #1 wants the “good half” of the missing filesystem (even though it is on the disks they have) but do not possess a fast internet connection so it must be FedEx’d. Starting to get a headache.
DS write the “good half” of the missing XFS filesystem to a USB drive and FedEx it, by dropping it into a “monday pickup” box. Four letter words.
Monday 23rd:
Bring up an old snapshot of the database that was not stored on storage array, in order to service search engine traffic to forums etc ( the majority of our traffic daily)
Lab #2 give a quote of $6k to $18k, and ask for the entire equipment stack, both storage arrays and the host as well. This is despite the missing chunk contained within just 6 drives. What a good business to be in!
Tuesday 24th (last few entries are my timezone dates, a day ahead):
Give Lab #1 a couple of days to deliver the goods, and ask for drives to be shipped to Lab #2 as a backup plan. We’re in “wait mode”.
postscript:
If you have extracted a virtual disk from the physical disks in a disk group, from a borked MD3000 before, feel free to drop your business card off to adminhelp (at) dslreports.com. At least as a backup to the backup plan.
My take:



