Trying out deduplication

I moved to DragonFly 2.10 over the past few days, and I tried out deduplication, to see what kind of results I would get.  The procedure is outlined below.  I’m using /home here as an example, just to reduce the amount of text pasted in.

/pfs/@@-1:00004     966000640 566434576 399566064    59%    /home

Move my various Hammer pseudo-file systems to version 5, which supports deduplication.

# hammer version-upgrade /home 5

Issue a deduplication simulate command, to see what it guesses will be the savings:

# hammer dedup-simulate /home
Dedup-simulate /home: objspace 8000000000000000:0000 7fffffffffffffff:ffff pfs_id 4
Dedup-simulate /home succeeded
Simulated dedup ratio = 1.22

That ratio turned out to be pretty accurate for the actual deduplication.  I didn’t time it, unfortunately.  I don’t know if the time taken is proportional to the amount of deduplication or the total volume of data, though I suspect the latter.

# hammer dedup /home
Dedup /home: objspace 8000000000000000:0000 7fffffffffffffff:ffff pfs_id 4
Dedup /home succeeded
Dedup ratio = 1.22
462 GB referenced
378 GB allocated
14 MB skipped
6869 CRC collisions
0 SHA collisions
0 bigblock underflows

The end result?

/pfs/@@-1:00004     966000640 505887504 460113136    52%    /home

That data space is shared across all file systems, and it’s a 1TB disk, so it’s 7%, or 70GB. I was hoping for more, but I don’t have any obviously duplicated data (no local mail store, no on-disk backups), so perhaps this is normal. 70GB that I didn’t have before is no bad thing, though.

Incidentally, I was able to upgrade my installed software from pkgsrc-2009Q4 to pkgsrc-2011Q1 entirely using pkg_radd -u <pkgname>.  Remarkably quick and painless, though pkgin may have been able to do it even faster since it would pull from the same place.

Lazy Reading for 2011/05/29


  • Do you like the Opera browser?  Apparently all it takes is a little misspelling to confuse it with a U.S. daytime talk show host.  The “Best of Oprah emails to Opera“.   (via)  Mistaken identity on the Internet is always fun.
  • Popular free software licenses, described.  (via)  One of the better, non-polemic descriptions I’ve seen.
  • For the opposite effect, the Free Software Foundation’s license recommendations.  Somehow, the BSD license isn’t even mentioned.  (via)  A commenter at the source link notes that the GNU Free Documentation License isn’t even considered ‘free’ by Debian.  Along those lines, I’ve always thought that GPL licensing creates a perverse incentive to keep your software undocumented.
  • The FreeBSD and NetBSD Foundations have acquired a license for libcxxrt from PathScale, which I assume is for C++ support in conjunction with clang.  (or pcc?)  This isn’t as much of an issue for DragonFly right now since we’re continuing down the GCC route.
  • Temple of the Roguelike, a searchable database of roguelike games.  It’s an idea that you would totally expect for this genre.  (via trevorjk on EFNet #dragonflybsd)  Also: a roguelikedev subreddit.

Lazy Reading for 2011/05/22

This week, the links are generally fun.