make ; make install: Contemplating cost-efficient, highly reliable storage

I have long been a fan of NetApp's storage products; my office purchased our first NetApp filer shelf back in 2005, and over the years have added an additional five shelves as our capacity needs have grown. Management of these devices has almost always been painless, and NetApp's support has generally been top notch. Having a new drive show up in my mailroom along with a note describing which drive slot it should go in -- because of a predicted potential failure -- without any involvement on my part? As someone who's charged with protecting valuable research data, this level of monitoring helps me sleep at night. And automated snapshots give my users the ability to restore most accidental deletions without ever having to admit to me or our support staff that they've done something careless, which is a win for everyone involved.

In fact, there's really only one downside to NetApp's product line: price. The most recent quote for a full NetApp system I've seen put the total first year cost for storage (with SATA drives) somewhere close to $10/GB, with significantly reduced annual maintenance fees thereafter. Lately, as the NetApp's dispute with Sun over WAFL and ZFS continues to crumble, I've been looking into ZFS and it's feasibility for our premium storage. Sun's lowest end storage servers (with high performance, low capacity SAS drives) have a list price point of about $4/GB, while commodity hardware prices for SATA drive based systems are $0.30/GB; as our total storage needs continue to skyrocket, it makes sense to see if I can replicate most of the critical abilities of NetApp storage at a more reasonable cost.

We will soon be deploying a storage server with somewhere between 32-48 TB of raw disk space, shared via NFS and CIFS. My plan is to install OpenSolaris on this hardware and set it up with ZFS RAIDZ2, ZFS's equivalent to the double-parity RAID-6. In preparation for this, I've been playing with OpenSolaris VMs, trying to get a feel for how management of ZFS is handled, and so far, I've liked what I've seen.

Over time, I'd like to document the critical steps I've had to go through to get a working production system, along with my conclusions along the way about the viability of this platform as a significant piece of our overall storage platform. For starters, here's a few of the things I've come across so far on my test system, which has two 20GB system disks and five 8gb data disks:

At least when using the OpenSolaris LiveCD installer, root mirrors have to be set up after the fact, and mirrored drives must be set up with Solaris slices

The OpenSolaris installer was a pleasure to use, but it's simplicity makes for a lot of additional tuning after the fact to get a usable system, and while ZFS is the default filesystem, you can only install to a single drive. My root partition was c3t0d0s0, so after booting into the system, I set about to add a mirror. The only hitch I ran into was an error about rpool (the Solaris root pool) needing to live only on Solaris slices -- I'd tried to attach an unlabeled c3t1d0 drive to the pool. After labeling the disk and adding a slice 0, I was able to create a mirror with a single command:

pfexec zpool attach -f rpool c3t0d0s0 c3t1d0s0

To watch the progress of data "resilvering" over to the new drive, I used:

zpool status rpool

After a few minutes, this process was done and I moved on with adding a data pool.

RAIDZ pools can't be internally adjusted

Initially I created a RAIDZ2 pool with three disks, the minimum number of devices possible, planning to add the other two later:

pfexec zpool create data raidz2 c3t2d0 c3t3d0 c3t4d0

While I had no problems with this command (creating an 8GB volume from three 8GB disks), adding drives later did not work as I'd hoped:

pfexec zpool add data c3t5d0 c3t6d0
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses raidz and new vdev is disk

Sadly, increasing the "width" of a RAIDZ pool is not currently a feature of ZFS, although it would be an important one to have, and is supposedly in the works. This doesn't mean that a RAIDZ pool is forever locked to it's original size, rather it means that to increase the pool's size, you have to add additional RAIDZ "subpools" (this is almost certainly not the correct terminology) which then operate side-by-side with the original RAIDZ logical device, instead of integrating all of the disks into a single logical device. This means that if you plan to build an expandable storage server, you'll need to think carefully about how you want to allocate parity disks, hot spares, and filesystems.

Anyhow, once I discovered this I destroyed the pool and recreated it with with all five disks:

zpool create data raidz2 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t6d0

This gave me a 24GB volume made up of five 8GB drives, which is what I'd intended for this test system. Next I set the mountpoint (which defaults to /<pool name>):

zfs set mountpoint=/export/data data

ZFS Deduplication may not be appropriate for many data sets

Next I added a couple of additional configuration options to enable compression and deduplication:

zfs set dedup=on data
zfs set compression=gzip data

Deduplication is very much a new feature in ZFS, and is currently only available in the very latest Release Candidate builds of OpenSolaris. It's not a feature that I would enable in production without carefuly testing not only for this reason, but also because I'm unsure of how it scales to very large filesystems (unlike NetApp, which has a 1-TB volume size limit on their A-SIS dedup technology, ZFS has no such hard limits); Deduplication requires that data block hashes be searchable (and potentially searched often), and at some point a large enough amount of data will require that index to spill out of RAM and onto disk, which could potentially have a very negative affect. Jeff Bonwick of Sun has written an excellent blog post that goes into more detail about dedup in general and ZFS's implementation.

While dedup is yet another feature of NetApp's storage that I find very valuable, ZFS dedup is new enough that there aren't enough field reports on it to have a good idea of how well it performs in extreme cases. Once I have real hardware to test this on (my test system lives on deduped NetApp volumes, so performance testing would be all but meaningless, not to mention cost-prohibitive given the space I would need), I hope to be able to shed some light on the matter of scalability; only time is likely to make me comfortable with the maturity factor.

make ; make install

Wednesday, January 13, 2010

Contemplating cost-efficient, highly reliable storage

0 Comments:

About Me

Previous Posts