User topics

Article topics

Log in Sign up

Flash drive meltdown fingered in Swedish IT blackout

Tieto's five-day outage disaster started with multiple failures of its EMC VNX5700 array's FAST Cache, according to a Finnish source close to the matter. Tieto is a major IT services organisation across Scandinavia and the Nordic region – although it also provides services globally – and pulls in net sales of SEK17bn (£1.59bn …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Monday 16th January 2012 09:31 GMT Anonymous Coward

so

I wonder why they had 4 SSD drives for Fast cache? I think they are RAID 1 so that means they didn't bother with a spare. I know SSD drives are expensive but come on!!

1 0
1. Monday 16th January 2012 11:52 GMT Destroy All Monsters
  
  Could it be the SSD was running in a RAID 1 and both sides went away at the same time?
  
  Relevant: http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html
  
  0 0
  1. Tuesday 17th January 2012 08:50 GMT Mike Schwab
    
    Write count exceeded?
    
    Did they check DAILY how many bad sectors they had on the SSD Flash memory? They have a limited number of writes and when the number of bad blocks start to rise you have to replace the SSDs.
    
    0 0
2. Tuesday 17th January 2012 07:53 GMT PeteA
  
  Title is optional
  
  RAID 1 is a mirror, i.e. every drive has a spare so 'N+N' redundancy characteristics, which is rather better than RAID 5 (N+1). You're confusing it with RAID 0...
  
  0 0
Monday 16th January 2012 09:35 GMT TeeCee

"EMC won't comment on any details..."

Hardly surprising with the smell of lawsuits wafting on the breeze.

Best you can hope for, once their PR and legal lads agree on the wording, is a few words that actually say nothing in an astonishingly vague way, padded with some background puff lifted verbatim from the product literature.

0 0
Monday 16th January 2012 10:00 GMT Wanda Lust

Pooled disks

As has been pointed out in the article, the service solution appears to have been inadequate to mitigate the risks of running such a workload on one array and failed to consider the recovery times resulting from possible failures.

If the system wasn't replicated then prudence would have suggested that RAID6 was a better protection method for the pool. Even if it was replicated, RAID6 is a good idea given the time required to move 450, 600 or more GBytes around. Poor prudence, rarely has anyone's ear.

2 0
Monday 16th January 2012 10:40 GMT Terafirma-NZ

I can believe it we just had a fast cache problem along with some others due to the previous build of flare oe and EMC pushed out a new build quite quickly and would not talk about anything until we were on it.

0 0
Monday 16th January 2012 11:53 GMT steogede

DR, what DR?

>> What needs to be stressed is that Tieto's DR processes were dreadfully inadequate and obviously untested for the eventuality of such a failure. Lawsuits over data loss and business interruptions at Tieto's affected customers are bound to follow. ®

I suspect that they are probably better at writing disclaimers than they are at developing DR plans.

0 0
Monday 16th January 2012 11:54 GMT Andy Moreton

Test your DRP

" inadequate disaster recovery plan involving Networker tape backup files which could not be read"

If you haven't tested your disaster recovery plan, you don't have a disaster recovery plan.

7 0
1. Monday 16th January 2012 13:15 GMT Field Marshal Von Krakenfart
  
  As one old sytems programmer said to me a very long time ago:-
  
  "F*ck going foward, as long as we can go backwards we'll be OK" It's a variation on data you havent saved in at least two locations, hasn't been saved at all!
  
  0 0
2. Monday 16th January 2012 13:38 GMT Bronek Kozicki
  
  you mean ...
  
  ... there is something wrong with this http://www.youtube.com/watch?v=K-qY_b8lo-k ?
  
  0 0
3. Monday 16th January 2012 16:13 GMT circusmole
  
  I am sick and tired...
  
  ...of telling my customers exactly this "If you haven't tested your disaster recovery plan, you don't have a disaster recovery plan." or "If you have not realistically and regularly fully tested your DR plan, I promise you IT WILL NOT WORK.".
  
  0 0
Tuesday 17th January 2012 09:13 GMT Hoosier Storage Guy

I thought FAST Cache was just cache

What's a little concerning about this to me, and some of the other rumors about failed SSD drives in FAST Cache causing big problems, is that FAST Cache is supposed to be a "cache". When hot blocks get promoted into FAST Cache, the EMC folk I spoke with in the past said the data was copied, not moved. That would cover the reads. As far as writes go, new updates to that block were supposed to get flushed down to spinning disk just as regular cache works. Your primary copy of the data wasn't supposed to be living in FAST Cache and susceptible to data loss. Additionally, if a FAST Cache drive failed and there was no hot spare (which appears to be the case here), FAST Cache was immediately supposed to go into read-only mode.

2 1
Tuesday 17th January 2012 09:13 GMT poladark

Tieto can't even spell right.

The title of the chart: "Verklig Prestanda Jämnförelse" is not even correct swedish. It should be spelled "Verklig Prestandajämförelse". Just to give people an inkling of the competence level of this company when they can't even formulate correct sentences with real words...

Their technical skills are evidently a good match for their spelling abilities.

0 0

This topic is closed for new posts.

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024