Utility IT service supplier Flexiant has blamed intermittently slow service over the past two weeks on slow accesses to disks on an Oracle ZFS-based 7400 storage array. There has been slow communication between virtual machines on the servers and the 7400 storage. It is being said that some customers' servers have effectively …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Wednesday 3rd November 2010 17:03 GMT Anonymous Coward

Run, forrest, run .....

This story is basically confirming/telling us that "2 weeks downtime" is never heard of in this industry until the Oracle/Sun 7000 series came to life.

Personally I believe this claim because I think zfs is still not mature enough and we personally

experienced a specific issue on 7410C that literally down for 3 days just trying to delete a dedup enabled LUN, yes you heard me right, just one click to delete a LUN could bring the 7410 cluster down for 3 days, that is why I believe it could cause 2 weeks down time for other end user....

I guess the end results will again point to zfs module (akd)...... then another software upgrade with new MAJOR bugs waiting to be found....

And then the sales team will tell you that you should never expect a 99.8% up time for this kind of product, you should set your expectation to somewhere near five Eight ( 88.888%) instead of five Nine....

By the way, we are off 7410C and onto NetApp since.

3 1
1. Wednesday 3rd November 2010 22:32 GMT Anonymous Coward
  
  akd != ZFS
  
  While I agree that the 7410 has had a lot of serious suckyness, and almost all of it is down to the akd management system being just dumb in design/implementation, that is not a ZFS issue.
  
  ZFS is the file system, and while it may be the underlying reason for your down-time deleting a LUN, far more likely is a fu*k-up with akd. Again.
  
  0 0
2. Wednesday 3rd November 2010 23:45 GMT Paul Johnston
  
  Dedup?
  
  That is cutting edge!
  
  Not sure I'd trust zfs to that degree just yet but it is very nice technology.
  
  See what it's like in a year.
  
  0 0
Wednesday 3rd November 2010 22:17 GMT Andy Nash

Can't believe they are still downplaying the problem.

Perhaps Mr Bligh is not kept informed by his staff, or doesn't spend enough time on the coal face?

I am currently just about connected to a single CPU Debian server on light duties, currently running with a load average of 154 @ 16:20. Other than that session, its been inaccessible since this morning, and I have not noticed the server running at a sane load average for more than an hour or two since this problem began. In between it tends over 2, and frequently rises to 3-6.

It is this high load average, frequently causing actual outages lasting several hours (not one hour as they claim), multiple times a day, which leads me to confidently say that the platform has been essentially unusable since I reported the issue, and as near as I can tell since two weeks ago when the problem first emerged (inbetween I was assuming I had a problem on my own server).

Last night the server so badly affected, that files were disappearing before my eyes which was pretty scary (though fortunately it turned out they were still in storage). That culminated in the server locking up, requiring a kill which has corrupted our database. It hasn't been up long enough since to attempt a repair.

Even trying to move my data elsewhere has proved impossible as it doesn't stay up long enough to transfer the data. We have a backup, but that's on the same platform as its primarily still a development server.

I've been told moving my server to different hardware would not make any difference, so presumably this affects all customers.

I reported this over a week and a half ago, and have been chasing continuously since by the way Mr Bligh - feel free to call me if there is any confusion over that - ticket #9793.

The whole time, I've felt they have tried to spin/downplay the size/impact of the problem and that appears to be the option they have chosen responding to this article. I am amazed that in this day and age a CEO will come out with nonsense like this without even checking facts properly.

[stop press - load is now at 187 and rising @ 16:53.]

0 1
Wednesday 3rd November 2010 22:29 GMT Max 6

ZFS is fine as long as you don't use RAIDZ

We've seen this for about three years now, and still no fix in sight. Basically, The minute you turn on RAID you can throw predictable latency out the window.

0 0
Wednesday 3rd November 2010 22:37 GMT Jacqui

Oracle "support"?

Hmm when Oracle on a certified supported/stable linux platform went belly up once a day (memory leak in Oracle itself) Oracle's fix (after two weeks of support hell) was to "reboot at midnight". They then closed the support ticket!

Perhaps they should include "its not our fault" on thier logo! Its seems downtime is never due to Oracle being a buggy, memory leaking mess.

These days we use PostgreSQL - it has its problems but a least if something goes wrong we can delve into the source (usually sort it ourselves) or hire a Pg internals specialist if need be. Oracle support is and always will be a joke.

0 0
Wednesday 3rd November 2010 23:44 GMT Paul Johnston

Quick get the Fishworks team to look at it!

Oh they left didn't they?

2 0
Wednesday 3rd November 2010 23:45 GMT Anonymous Coward

7410 and iSCSI do not mix well

From the specs on their website...."These provide iSCSI LUNs to the end-user operating systems, looking just like physical disks".

In our experience the 7410 ZFS external slog on SSD (logzilla), which is designed to accelerate and quickly acknowledge sync writes, quickly gets overrun by iSCSI write traffic and then spends all of it's time flushing to disk and slowing everything down.

Tread carefully with these devices and what part of the business you are willing to bet on them.

1 0