back to article Flash drive meltdown fingered in Swedish IT blackout

Tieto's five-day outage disaster started with multiple failures of its EMC VNX5700 array's FAST Cache, according to a Finnish source close to the matter. Tieto is a major IT services organisation across Scandinavia and the Nordic region – although it also provides services globally – and pulls in net sales of SEK17bn (£1.59bn …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    so

    I wonder why they had 4 SSD drives for Fast cache? I think they are RAID 1 so that means they didn't bother with a spare. I know SSD drives are expensive but come on!!

    1. Destroy All Monsters Silver badge

      Could it be the SSD was running in a RAID 1 and both sides went away at the same time?

      Relevant: http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html

      1. Mike Schwab
        Megaphone

        Write count exceeded?

        Did they check DAILY how many bad sectors they had on the SSD Flash memory? They have a limited number of writes and when the number of bad blocks start to rise you have to replace the SSDs.

    2. PeteA
      Facepalm

      Title is optional

      RAID 1 is a mirror, i.e. every drive has a spare so 'N+N' redundancy characteristics, which is rather better than RAID 5 (N+1). You're confusing it with RAID 0...

  2. TeeCee Gold badge

    "EMC won't comment on any details..."

    Hardly surprising with the smell of lawsuits wafting on the breeze.

    Best you can hope for, once their PR and legal lads agree on the wording, is a few words that actually say nothing in an astonishingly vague way, padded with some background puff lifted verbatim from the product literature.

  3. Wanda Lust

    Pooled disks

    As has been pointed out in the article, the service solution appears to have been inadequate to mitigate the risks of running such a workload on one array and failed to consider the recovery times resulting from possible failures.

    If the system wasn't replicated then prudence would have suggested that RAID6 was a better protection method for the pool. Even if it was replicated, RAID6 is a good idea given the time required to move 450, 600 or more GBytes around. Poor prudence, rarely has anyone's ear.

  4. Terafirma-NZ

    I can believe it we just had a fast cache problem along with some others due to the previous build of flare oe and EMC pushed out a new build quite quickly and would not talk about anything until we were on it.

  5. steogede

    DR, what DR?

    >> What needs to be stressed is that Tieto's DR processes were dreadfully inadequate and obviously untested for the eventuality of such a failure. Lawsuits over data loss and business interruptions at Tieto's affected customers are bound to follow. ®

    I suspect that they are probably better at writing disclaimers than they are at developing DR plans.

  6. Andy Moreton
    Thumb Down

    Test your DRP

    " inadequate disaster recovery plan involving Networker tape backup files which could not be read"

    If you haven't tested your disaster recovery plan, you don't have a disaster recovery plan.

    1. Field Marshal Von Krakenfart

      As one old sytems programmer said to me a very long time ago:-

      "F*ck going foward, as long as we can go backwards we'll be OK" It's a variation on data you havent saved in at least two locations, hasn't been saved at all!

    2. Bronek Kozicki
      Thumb Up

      you mean ...

      ... there is something wrong with this http://www.youtube.com/watch?v=K-qY_b8lo-k ?

    3. circusmole
      Unhappy

      I am sick and tired...

      ...of telling my customers exactly this "If you haven't tested your disaster recovery plan, you don't have a disaster recovery plan." or "If you have not realistically and regularly fully tested your DR plan, I promise you IT WILL NOT WORK.".

  7. Hoosier Storage Guy
    FAIL

    I thought FAST Cache was just cache

    What's a little concerning about this to me, and some of the other rumors about failed SSD drives in FAST Cache causing big problems, is that FAST Cache is supposed to be a "cache". When hot blocks get promoted into FAST Cache, the EMC folk I spoke with in the past said the data was copied, not moved. That would cover the reads. As far as writes go, new updates to that block were supposed to get flushed down to spinning disk just as regular cache works. Your primary copy of the data wasn't supposed to be living in FAST Cache and susceptible to data loss. Additionally, if a FAST Cache drive failed and there was no hot spare (which appears to be the case here), FAST Cache was immediately supposed to go into read-only mode.

  8. poladark

    Tieto can't even spell right.

    The title of the chart: "Verklig Prestanda Jämnförelse" is not even correct swedish. It should be spelled "Verklig Prestandajämförelse". Just to give people an inkling of the competence level of this company when they can't even formulate correct sentences with real words...

    Their technical skills are evidently a good match for their spelling abilities.

This topic is closed for new posts.