80 days... is that all?
You've got to wonder what sort of testing has been done on an enterprise product that reliably crashes after 80 days uptime.
We've been hearing about a bug in EMC's VNX2 software which causes it to reboot every 80 days. Tweets like this one have been appearing like canaries in a coalmine: @TheJasonNash I just bought a Vnx28000 and expect another one coming soon... Anyways, looking for info on the vnx reboot issue? New to EMC — Nathan Daggett (@ …
Of course it sounds pretty funny... you have to reset to avoid unexpected resets...
But I don't believe EMC had not been testing new generations VNX less than 80 days. It sounds like a very familiar bug, annoying common today in Enterprise systems about an unexpected reboot after a fixed time frame. I had the same problema with Brocade switches and I believe all those bugs have a common root.
Believing today Enterprise systems are buggiest than yesterday is for very young systems engineers... I could write a book about this kind of issues along my career, started a lot of years ago.
I'd hope EMC don't still have issues like this on the first generation VNX, but regardless the Bull support page linked specifically mentions VNX2 as being the affected platform.
"URGENT : Unexpected SP panic on new VNX2 every 80 days"
http://support.bull.com/ols/product/storage/disk/emc-vnx
Sounds from the article like they've now patched it ? The workaround is effectively reboot it before it reboots itself. Note the 30 minute offset between reboots will reset the counter just in case you don't manage to apply the patch, then at least the controllers will reboot 30 minutes apart rather than together.
Not very enterprise though, having to lose half your processing power just to reset a ticking time bomb (counter).
If you're stuck with one of these things, it's probably a good idea to time those reboots for when the SPs are running < 50% load. Only high school football teams can perform at 110%....storage arrays cannot. And I agree with the AC above - is the problem the zillions of lines of the 24 year olde Clariion code, or the 64 bit parts that were strapped onto it so that marketing could call it 'VNX2'?
Here's the cause, from the horses mouth
Cause
A logic error in 64-bit math causes a timer overflow within each Storage Processor, resulting in a first stage WatchDog [WDT] panic (which results in an NMI).
Software periodically requests the number of micro-seconds since boot. This information is continuously fed to another component, as an indication that the VNX OE software is not hung. This aids in protection against starvation. When the overflow issue is encountered (every ~80 days); it causes disruption in the software which is aimed at identifying both software and hardware hang. The result is that VNX OE software believes the Storage Processor is hung, resulting in the (deliberate]) WatchDog panic.
Avamar had an issue like this, but more so related to it's Linux OS. Pretty much had to do a reboot each year, until the appropriate patches were installed. A few other EMC products, also running under linux had this issue.
Luckily for EMC, they are aware of which systems and customer are affected, and were proactive about the fix.
This one though, a bit more extreme at only 80 days. Seems like QA as of late is taking a dump here. This with the avamar/datadomain issues, what a pain in the arse.