back to article Oracle storage analytics break Oracle storage appliances

An Oracle employee has warned that the analytics features of its ZFS storage appliances can result in “unresponsive” systems. The post linked to above opens with Oracle staffer Matt Barnson stating “I've received a number of questions about analytics and the problems they cause for the Oracle ZFS Storage Appliance.” There's …

  1. iOS6 user

    So looks like we have here something which call "quantum effect of the monitoring" when monitoring/instrumentation is affecting performance of monitored object.

    "Nihil novi sub sole" .. you always must know what you are doing.

    1. Stevie

      Re: know what you're doing

      That pre-supposes Oracle ever wrote down such information and is willing to show you what they wrote.

      Neither of these is a given.

      1. mbarnson

        Re: know what you're doing

        Here ya go. Or you can Google "Oracle ZFS Analytics" and come up with the same result.

        http://docs.oracle.com/cd/E56047_01/html/E56082/goywy.html#scrolltoc

    2. mbarnson

      You nailed it. Analytics are incredibly powerful, but if you enable the hidden "Advanced Analytics" feature, you should know what you're doing. The help and Analytics Guide both cover this in great detail, despite Simon's allegation to the contrary.

  2. SJG

    I've encountered a few problems along the way that adding some level of monitoring made a particular problem go away. IIRC, first time I saw this was on an ICL mainframe in their Quickbuild Application Master 4GL where just referencing a variable altered the namespace so that the correct variable was then subsequently picked up. Probably sometime around 1990 IIRC.

    1. Stevie

      Re:

      The classic example of that is the Unisys ASCII COBOL problem that "goes away" when "MONITOR ALL" is used, but knowledgeable types just tell the Expert Programmer to use extra option 7 on the compiler so they will get an error when their "logic" zooms off the end of a table. The monitor code simply pads out everything so the table pointer is still within bounds (but wrong).

    2. mbarnson

      Well, what I wrote about in the linked blog isn't quite the same thing. Basically, dTrace can open "probes" in the Solaris kernel to give you insight into what's going on. The fundamental problem is that if users enable "Advanced Analytics", they can end up creating probes with thousands (or more) insertion points.

      Imagine opening an analytic to evaluate accesses by file name on a filesystem with billions of files that are changed at some interval. That it's allowed at all on a storage appliance is excellent -- and I use it all the time! -- but if left running, tracking all changes to all files forever, it will cause performance challenges.

      Disclaimer: My opinions are not necessarily those of Oracle.

  3. Anonymous Coward
    Anonymous Coward

    We found the whole "ZFS appliance" thing was a disaster, with fail-over reliability and times both pretty miserable. Sad really, as Sun started from a few good basics but managed to write an absolute crap management layer on top of certain Frankenstein variants of production ZFS & Solaris.

    Still, we are looking forward to not being on Oracle support just as soon as we can get the OK from above to buy a replacement storage system.

    1. Anonymous Coward
      Anonymous Coward

      Snap. Amazing this advice is on a blog, though typical of Oracle at the same time.

      First time we updated it, had to pull the plug and roll back. I've lost count of the number of times a disk has failed, but it's failed because the appliance thinks it has and is actually ok. Support say "reboot to resolve", easier said than done.

      Too full of big scary bugs, thankfully it will be ditched in the coming months.

      1. mbarnson

        "it''s failed because the appliance thinks it has and is actually ok."

        Huh. That's an interesting description. I deal with hundreds of ZFS appliances every day, and the fact that our appliances find disks with high error rates long before SMART can because we checksum every read & write (rather than a simple CRC) is IMHO a really helpful feature in a mission-critical storage system.

        Disclaimer: My opinions do not necessarily reflect those of Oracle.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like