Tieto's five-day outage disaster started with multiple failures of its EMC VNX5700 array's FAST Cache, according to a Finnish source close to the matter. Tieto is a major IT services organisation across Scandinavia and the Nordic region – although it also provides services globally – and pulls in net sales of SEK17bn (£1.59bn). …
I wonder why they had 4 SSD drives for Fast cache? I think they are RAID 1 so that means they didn't bother with a spare. I know SSD drives are expensive but come on!!
Could it be the SSD was running in a RAID 1 and both sides went away at the same time?
Title is optional
RAID 1 is a mirror, i.e. every drive has a spare so 'N+N' redundancy characteristics, which is rather better than RAID 5 (N+1). You're confusing it with RAID 0...
Write count exceeded?
Did they check DAILY how many bad sectors they had on the SSD Flash memory? They have a limited number of writes and when the number of bad blocks start to rise you have to replace the SSDs.
"EMC won't comment on any details..."
Hardly surprising with the smell of lawsuits wafting on the breeze.
Best you can hope for, once their PR and legal lads agree on the wording, is a few words that actually say nothing in an astonishingly vague way, padded with some background puff lifted verbatim from the product literature.
As has been pointed out in the article, the service solution appears to have been inadequate to mitigate the risks of running such a workload on one array and failed to consider the recovery times resulting from possible failures.
If the system wasn't replicated then prudence would have suggested that RAID6 was a better protection method for the pool. Even if it was replicated, RAID6 is a good idea given the time required to move 450, 600 or more GBytes around. Poor prudence, rarely has anyone's ear.
I can believe it we just had a fast cache problem along with some others due to the previous build of flare oe and EMC pushed out a new build quite quickly and would not talk about anything until we were on it.
DR, what DR?
>> What needs to be stressed is that Tieto's DR processes were dreadfully inadequate and obviously untested for the eventuality of such a failure. Lawsuits over data loss and business interruptions at Tieto's affected customers are bound to follow. ®
I suspect that they are probably better at writing disclaimers than they are at developing DR plans.
Test your DRP
" inadequate disaster recovery plan involving Networker tape backup files which could not be read"
If you haven't tested your disaster recovery plan, you don't have a disaster recovery plan.
As one old sytems programmer said to me a very long time ago:-
"F*ck going foward, as long as we can go backwards we'll be OK" It's a variation on data you havent saved in at least two locations, hasn't been saved at all!
you mean ...
... there is something wrong with this http://www.youtube.com/watch?v=K-qY_b8lo-k ?
I am sick and tired...
...of telling my customers exactly this "If you haven't tested your disaster recovery plan, you don't have a disaster recovery plan." or "If you have not realistically and regularly fully tested your DR plan, I promise you IT WILL NOT WORK.".
I thought FAST Cache was just cache
What's a little concerning about this to me, and some of the other rumors about failed SSD drives in FAST Cache causing big problems, is that FAST Cache is supposed to be a "cache". When hot blocks get promoted into FAST Cache, the EMC folk I spoke with in the past said the data was copied, not moved. That would cover the reads. As far as writes go, new updates to that block were supposed to get flushed down to spinning disk just as regular cache works. Your primary copy of the data wasn't supposed to be living in FAST Cache and susceptible to data loss. Additionally, if a FAST Cache drive failed and there was no hot spare (which appears to be the case here), FAST Cache was immediately supposed to go into read-only mode.
Tieto can't even spell right.
The title of the chart: "Verklig Prestanda Jämnförelse" is not even correct swedish. It should be spelled "Verklig Prestandajämförelse". Just to give people an inkling of the competence level of this company when they can't even formulate correct sentences with real words...
Their technical skills are evidently a good match for their spelling abilities.
- HUGE iPAD? Maybe. HUGE ADVERTS? That's for SURE
- Tim Cook: I'm NOT worried about CRAP iPad sales. It's just a 'speedbump'
- Too slow with that iPhone refresh, Apple: Android is GOBBLING up US mobile market
- HP busts out new ProLiant Gen9 servers
- Loss of unencrypted back-up disk costs UK prisons ministry £180K