Continuing to toy with rdiff-backup. There's enough areas that you *don't* want to back up (or rather, don't need to, such as icon and font caches) that you wind up with some pretty hideous command-line operations if you go that route, but this post is about another need, namely the need to "compress" the storage size by hard-linking duplicate files in the backups.
This really shows up on my system because of the use of virtualenv. I have hundreds of MBs of installations which are all pretty-much identical (save for a different "final project", each TurboGears2 project is pretty-much the same virtualenv). It would be nice to have a mechanism to do file fingerprinting and hard-linking to reduce the impact of these files, that is, if two files are identical, and have the same metadata properties (I'm thinking "except timestamps"), then hard-link them in the backup (make the file-system use the same bytes for both copies).
This would work for e.g. music/media files as well, if the files are the same, regardless of where they are on the disk, just record the link to them. IIUC this is approximately what git does for its storage, so I suppose you could build something like this using a git back-end. A simple mechanism to hook into post-backup rdiff-backup (or something that drives an rdiff-backup back-end) and do fingerprinting/compression would work about as well for me.