Should you use flash solid state drive (SSD) storage as a pretend hard disk drive or as a cache attached to a server’s main bus? These are two approaches that have emerged about using flash in large scale storage applications. EMC, with the help of STEC, says to use drop-in Fibre-Channel-attached SSDs, which function like very, …
Err... missing the point
Using Flash in drives behind "slow" and "inefficient" protocols reduces drastically the time to market of a flash-enabled system. High end SAN/NAS software can already use tiered disk layouts where a set of "fast" disks is used as a front-end cache. Replacing these with flash ones requires virtually zero development and can be done today.
Compared to that, placing NAND on the PCI-e bus requires writing "proper" storage drivers and integrating them into the underlying OS. This is a rather expensive process and time consuming process especially for a specialised OS. In addition to that, the benefit actually depends on the way the OS deals with storage in the first place. If the OS is specialised and built around the abstractions and concepts used in "spinning" disks there is very little benefit in using a different access interface to new "superior" storage.
This is all besides the fact that "slow" and "inefficient" protocols are not that slow and inefficient. FC is not that far off in terms of access speed from on-the-bus flash.
RE: Anton Ivanov
Agreed. And then the reason many people look at "fast disk" is because they don't have enough local memory, which is faster than a PCI-e-attached flash device could be. For example, we put as many hot tables from our Oracle databases in system memory as we can, which means lots of expensive memory, but it is easy to implement. If the PCI-e device acted like a disk, like the old Giga-Byte iRAM PCI cards did, then that would also be easy to implement. If a PCI-e-based device came out tomorrow which didn't behave like a disk it would mean a disruptive six-month round of testing before acceptance and probably mean a new version of OS and apps to support it. Far simpler to just add more system RAM, or use a SAN-attached SSD array like the Texas Memory Systems RamSan devices (which are basically the EMC idea but we add them to existing SAN designs without having to upgrade the main arrays).
Your take on IBM Quicksilver
What's your view on IBM's Project Quicksilver?
You are absolutely correct, putting NAND into disk drive form factors and protocols is just a short term, time to market thing, and integrating NAND correctly involves more work in building OS specific software drivers.
One interesting thing, though, is that when OS's were originally built the relative performance of disks to CPU was such that things like virtual memory swap and 512 byte sector sizes actually made sense, and worked well.
Luckily, these vestiges of the good-ole-days (when disks weren't left so far behind by Moore's laws) are still around, because they actually make a sense again now.
With NAND SSD's, the relative performance to CPU's is pretty much where it was 20-25 years ago with HDD's. This means demand paging is actually viable - without modifying OS's. This makes for terabyte-scale virtual memory systems that perform amazingly well. After all, demand fetching 100,000 pages per second is a whole lot better than getting just 200. Now you can really use that 64 bit addressing!
512 byte IO makes sense when one can get 160,000 of those per second, vs. just 100,000 4K packets . If only 512 bytes are needed, that's a 60% performance benefit (thank goodness converting OS's and FS's to 4K sectors has proved too difficult - it would just need to be reversed now).
In other words, you'd be surprised how well things just work when you have 100,000 4K read IOPS, 70,000 4K write IOPS and 800 MB/s bandwidth to spare - and from each drive, thanks to near linear scaling.
It's being directly on the PCIe bus that makes scaling efficient. If one SSD can run over top of the fastest RAID controller out there, you'd have to put in a RAID controller per SSD. That's in essence what we've done by integrating the RAID controller with the SSD and put the whole thing right in where the RAID controller would have gone.
Net result, 6 GBytes per second of bandwidth with just 8 ioDrives in a server, and more IOPS than the CPU's can handle - for under $18K (for a total of 640GBytes of capacity, 2.5TB with the larger modules).
6 GBytes per second is a significant fraction of the raw memory bandwidth. You won't get anything close to this from a DRAM appliance or disk array accessed through a glass straw. If all you want is the performance you'd get from one of those, then just one ioDrive in place of the FibreChannel adapter will do.
And, no external box's to manage, it's just another disks drive to the OS, just faster.
But it's not quite that simple
Without getting into the interface battles (PCI vs. FC/SAS/IB/etc), consider a few things:
PCI/x (et al) is limited in the number of devices it can support and the distances they can be from one another, while FC (et al) are far easier to extend over distance and can support far more addressable devices.
Performance-sensitive datasets frequently exceed the size of even a few disk drives. Given the cost of NAND in the short term, the most economical use could well be as an L3 cache instead of a tier 0, at least for some applications.
There are scarce few applications whose access density is hundreds of thousands of IOPS against a few hundred GB of data - more typically high-IOP applications require far more storage capacity (if they didn't, they'd just load it all into SDRAM).
As dynamic cache, NAND isn't significantly more dense than DDRx SDRAM, while it is significantly slower (particularly for writes). And while cheaper per GB, the $ IO/response time is that different from SDRAM either.
The default I/O size for most database engines is 8KBytes - locality of reference studies show that 8K is better than 4K for caching (yeou basically get the extra 4K for "free"). Other studies have shown that modern databases perform better with 128KB read granularity. Smaller isn't necessarily better, unless you're truly operating at memory buss speeds - NAND is too slow for that.
As permanent storage, NAND needs added layers of data integrity protection, and not just within the device itself - mirroring or RAID is required to protect against the inevitable device failure. This adds a not insignificant amount of performance-sapping overhead - it is doubtful that the Quicksilver configuration would have come anywhere close to 1M IOPS had error detection, correction and prevention been included.
Although counter-intuitive, customers frequently find that applications perform better with external SDRAM SSDs and even large-cached storage arrays than they do with internal caches OF THE SAME SIZE - there's a significant advantage in cache algorithms that are disk-aware (seek & rotational positioning, etc.)
And the most important consideration is this - the folks arguing for "in the server" either build servers or build PCI-bus products; the ones arguing for "in the storage" build arrays or drive-form-factor devices. But both servers and storage today use SDRAM and spinning rust - there's no reason that NAND (or any other technology) has to be an Either-Or discussion.
The reality will be both will thrive, and most likely within the same I/O path. The debate is truly pointless!
- the storage anarchist
Oh, and I almost forgot
David, you're going to have to help me understand why I need a device that can deliver more IOPS than the CPU can handle?
When I see that, immediately I start thinking - what I REALLLY need is the ability to share all these IOPS across multiple different CPUs and applications, instead of wasting them buried inside a single server.
Answer: put all the IOPS in an external storage array and amortize the cost and performance across multiple applications!