back to article Cloudy biz Vesk suffers 2-day outage – then boasts of 100% uptime

A failed storage controller caused a protracted outage at hosted desktop and cloud slinger Vesk - not that this factoid has made its way onto the company’s website, where it boasts of 100 per cent uptime for the past 1,583 days. Vesk, acquired by London-listed Nasstar plc in October, has written to customers in a bid to …

  1. Doctor Syntax Silver badge

    Is the Trades Description Act still a thing?

    1. This post has been deleted by its author

  2. Missing Semicolon Silver badge
    FAIL

    One failed drive?

    .. and it took the whole storage system out?

    What kind of redundancy is that?

    1. 's water music

      Re: One failed drive?

      .. and it took the whole storage system out?

      What kind of redundancy is that?

      The kind delivered in an envelope to the architects' team?

    2. Throatwarbler Mangrove Silver badge
      FAIL

      Re: One failed drive?

      Unfortunately, I have seen this behavior on a major storage vendor's product. In that case, I believe the supported volume was RAID 0, and the controller panicked as a result of the volume going offline, not because of the disk failure, per se. It's still bad design--a non-root volume going offline should not take out the controller.

  3. Androgynous Cow Herd

    Evaluating the storage controllers?

    I think they need to evaluate the whole platform. A RAID degradation triggered a really bad failover event that included a whole lot of LUN trespass from the description. That's bad, really bad. Did they built that steaming turd of a storage platform in house, or just not follow their manufacturers best practices?

  4. DaLo
    Alert

    Most of that doesn't make sense...

    "Specifically, a failed hard disk in the Storage Access Network caused a “panic event” on the primary controller that triggered a failover between two storage controllers.

    The storage fail led to a “split brain event” and subsequent “levels of corruption within each virtual desk as they were being served by independent controllers,” Vesk said in the letter."

    A failed disk is not a failed storage controller. Assuming RAID 10 the system should have carried on merrily with a priority warning as it rebuilt from its hot spare. Failing that it should have alerted the engineer to replace a disk.

    If they choose to failover on that event it would have required the cooperation of both storage systems (the active one to decide to initiate a failover and the passive one to agree). This doesn't result in a split brain as both systems are on board with the process.

    A split brain would occur when the network link (heartbeat) is broken and there is no machine that is acting as Quorum.

    What would be a more likely scenario is that a RAID rebuild (probably RAID5) due to a failed disk cause a slowdown which caused a panic situation *in an engineer* who decided to failover and tried to do it by pulling out the heartbeat cable or restarting the active server (possibly even by pressing the power button) this caused a split brain which led to the data corruption.

    The DR site hadn't been tested for some time and so they didn't realise that for any number of reasons the new DR VMs wouldn't boot. The fact that it was Sharepoint and Exchange almost certainly means that the DBs were not in a clean shutdown state as they often are when there has been some db corruption and/or backups of a snapshot. This would result in the db having to do a soft repair and if it can't access all the logs in a consistent format would require a recovery from a backup (might take a long time) or a hard repair (will take even longer)

    The real fun part starts now - The plan to switch back to the main data centre, that might not be easy seeing as they appear to have issues in their redundancy and business continuity plans so far.

    Expect another outage any time soon.

    P.S. One thing I have seen a fair bit is redundancy and DR efforts can often exacerbate issues that could have been quite minor -if you had a single server would not have had that issue. It is sometimes the process of 'failing over' that causes more issues than the original problem.

    1. DaLo
      FAIL

      Re: Most of that doesn't make sense...

      Their SLA makes interesting reading - read it and think about it next time a supplier boasts about how great they are http://www.vesk.com/our-datacentres/the-vesk-sla/

    2. Anonymous Coward
      Anonymous Coward

      Re: Most of that doesn't make sense...

      Sounds a lot more like a glusterfs cluster blowing up than raid from their language. Bad architecture on their part.

  5. Anonymous Coward
    Anonymous Coward

    Vesk runs on Dell garbage

    Good luck

  6. Slabfondler
    FAIL

    One might think their "About Us" Page requires a minor update.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like