Page 1 of 1

dslreports borked by power outage, one week and counting

Posted: 24 Apr 2012, 07:16
by rolf
http://www.dslreports.com has been down for a week and webmaster, Justin, posted a summary of events that led to this. It seems there are good and bad data farms, good and bad data recovery personnel, but I'll leave interpretation to the webmasters among us. :A

There is a google Doc page Justin wrote that requires google log in here and a copy/paste, below.

Here is the sequence of unfortunate events

Monday 16th

Main and UPC Power failure at data center (DS)
MD3000+MD1000 dual controller, dual host storage array containing 2TB of RAID10 volumes, containing basically everything, comes up confused
Working remotely, identify the key to restoration: 100gb, 50% of an XFS partition
Lacking an entire “dark” duplicate storage stack to bring up, get Dell involved

Tuesday 17th

Dell support says they don’t know the cause, but we must wipe entire array, do firmware upgrades, and start again. I don’t trust this gear.
Check backups, mail: ok, nfs:ok, site files: ok. The sql backup is incomplete.
Since all physical drives are green, and RAID10, so contain two copies, we request all 13 that contain this and other volumes, by carefully pulled, labelled, and shipped to Lab #1 (Colorado). Lab #1 (and #2) say they see MD3000 problems frequently. I am still relaxed at this point.
DS employee uses his initiative to decide that all disks are hot-pluggable, so pulls them all for shipping, with the equipment powered up. Getting a bit more tense.
Recovery info is presented to Lab #1 identifying LVM2 uuid, storage array config, etc

Wednesday 18th:

Lab #1 receives and images all drives ok

Thursday 19th:

Lab #1 fails to answer direct questions on whether scan of retrieved images shows missing UUIDs or not. Getting more tense still.
Estimated “success” bill rises to $8k

Friday 20th:

Lab #1 wants the “good half” of the missing filesystem (even though it is on the disks they have) but do not possess a fast internet connection so it must be FedEx’d. Starting to get a headache.
DS write the “good half” of the missing XFS filesystem to a USB drive and FedEx it, by dropping it into a “monday pickup” box. Four letter words.

Monday 23rd:

Bring up an old snapshot of the database that was not stored on storage array, in order to service search engine traffic to forums etc ( the majority of our traffic daily)
Lab #2 give a quote of $6k to $18k, and ask for the entire equipment stack, both storage arrays and the host as well. This is despite the missing chunk contained within just 6 drives. What a good business to be in!

Tuesday 24th (last few entries are my timezone dates, a day ahead):

Give Lab #1 a couple of days to deliver the goods, and ask for drives to be shipped to Lab #2 as a backup plan. We’re in “wait mode”.


postscript:
If you have extracted a virtual disk from the physical disks in a disk group, from a borked MD3000 before, feel free to drop your business card off to adminhelp (at) dslreports.com. At least as a backup to the backup plan.


My take:


Re: dslreports borked by power outage, one week and counting

Posted: 24 Apr 2012, 12:13
by viking60
Yes I don't even know who wants to listen to all that crap. Talking "Chineese" to wash your hands is a desease in the computer business.
If you deliver your car to get fixed; your employer would not accept a lot of talk of what has been done to the clutch or where it has been sent and what they have done etc.

People even let Sales people get away with it. I have a lot of fun in computer stores when that young "world champion" sales person always asumes that I have absolutely no clue and throws in a lot of complicated words that he asumes that I do not understand.

You see; he is convinced that the more he can say without me understanding anything, simply must impress me.
So I play along and act exactly as stupid as he thinks I am.
So having heard about this fantastic PC with Optimus +++ I ask him what Optimus means. It often turns out that they have no clue what they are talking about.

So basically they must impress themselves too :mrgreen:

In this case I would say:
If you use RAID 10 with all that redundance why do you send all your disks away?

Beeing shocked and wondering"who has pissed in my pants"! may look dynamic, but does not cut it. :hand:

This is not force majeure .

Re: dslreports borked by power outage, one week and counting

Posted: 24 Apr 2012, 12:59
by R_Head

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 09:54
by rolf
viking60 wrote:..
In this case I would say:
If you use RAID 10 with all that redundance why do you send all your disks away?

Beeing shocked and wondering"who has pissed in my pants"! may look dynamic, but does not cut it. :hand:

This is not force majeure .


I don't know what the exact circumstances are but it seems to me this guy developed the website out of his house, has a family (spoke of his attention being divided by his son's birthday, last weekend), and, perhaps, is married to a fetching young thing who's not too fond of server racks in her walk-in closet.. :confused

There's a not too encouraging update:
Update 27th April:
Whenever we’ve had the possibility of a good or bad outcome, we get dealt the bad outcome. Inspection of all disks has so far not supported any better theory than the SAN (storage array) over-wrote enough data to make full restoration a fragmented process, at best. I would like to dump technical detail as it seems a ludicrous scenario, but cannot until the analysis is complete. I’m not going to be convinced until we get more than just one professional opinion.

In the event things drag on for longer I could unlock the site for “normal” use, leaving space for whatever data recovery turns up, however this in itself is tricky. Meanwhile I will free up more older but still useful pages. This is no help to members who wish to use restricted forums or get to their instant messages, or post in forums, but it at least heads off some of the search engine damage created by unavailable URLs.

I hope we’ll catch a break in the next few days and can post something more optimistic!

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 10:37
by viking60
Looks like he needs the attention
Image
To help him structure it better you can give him a simple set of rules:
Image

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 11:06
by rolf
I'm not getting the hostility. Assuming the guy is paying for services, there might be a reasonable expectation of a certain level of assurance the data will be protected; it is the job of this service provider, isn't it? If, in fact, the server installation was mismanaged and/or data recovery steps were bumbled, client has a right to grouse about that, imo. Nobody can guarantee everything. Lesson learned only if what went wrong where is understood. dslreports.com denizens are suffering withdrawals, no doubt, probably appreciate a little of the journal. At any rate, better than to be left in the dark.

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 12:04
by viking60
Ah I am not sure I get the situation then. So I might be wrong above.
It might be the job of his service provider in that case. So the tips above is for the provider then.
The description of the complexity of the error is not a reassuring policy. That is my point. I always get annoyed when I have a deal breaker who starts describing in detail why he cannot keep the deal.
People want what they are promised - It is that simple - and that brutal. And moving the explanation down the chain is no good solution. In some cases that is all you are left with though - that is why we have insurance.
If he is a private person, not a business - it is a whole different enchilada.

But I am still confused - I thought he was the ISP? Dealing with typical ISP hardware?

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 12:25
by rolf
He, as I understand, created the website, dslreports.com, maintains the software, wrote some custom code. Homegrown operation that got pretty big, over the years, volunteer moderators, still just him behind the scenes, with ultimate control, weighing in on forum threads from time to time, sound familiar? :coffee_smile:

He's fought off DDOS attacks as traffic has grown to make the site an attractive target and, I'm sure, gets some sort of ad revenue on the pages served to not-logged-in-readers. It's a pretty comprehensive site but I still see him as a sole proprietor, afaik, not ISP, just webmaster.

The site attracts the more computing-literate type, so the industry jargon is not so much a smoke screen as just part of day-to-day stock in trade. Sure, he made the choices but I don't know that he did anything particularly foolish, could be more the victim of circumstances and, possibly, the choices made by those who house the website data.

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 15:00
by viking60
In that case he deserves all the support he can get :coffee_cup: +1
And I take back all the criticism. :oops: We need people like that it makes the place called internet interesting...

Re: dslreports borked by power outage, one week and counting

Posted: 28 Apr 2012, 15:39
by rolf
:cheers

Re: dslreports borked by power outage, one week and counting

Posted: 12 May 2012, 21:55
by rolf
Mostly, the site is back up. There is some update, perhaps of some interest to some bodies, about the expense of such an event:
Update: 7th May
I am told by the second lab the missing data is intact up to the time of the power fail. I have to now wait for payment to go through and return by courier of the media. Then there will be time spent checking the data, and restoration of both data and hardware without adding too much downtime. For the curious: this event has cost $28,000 in lab recovery fees. This is not including any money that we must now spend on new hardware, does not include financial impact of downtime, permanently lost traffic, and so on.

Re: dslreports borked by power outage, one week and counting

Posted: 12 May 2012, 22:39
by dedanna1029
Wooooooooooooooooooooooowww....