Short dupmerge tutorial

This tutorial is about the freitag version of dupmerge (dupmerge2 on, version 1.73 or later.
The author successfully tested dupmerge, compiled with gcc, under 32 and 64 Bit Linux and with Cygwin under 32 and 64 Bit MS-Windows (32 Bit Win XP, 64 Bit Win 7).
The principle is to select the files for duplicate merging (or unmerging) with find from GNU Findutils and to process these files with a side effect of sorting. So the files are compared direct and not with checksums to avoid problems like hash collisions.


1. Download and if it's a zip package, unzip with "unzip <package>".

2. Compilation (with preprocessing, linking etc.): gcc -D_GNU_SOURCE -Wall -O3 -o dupmerge dupmerge.c

gcc should give no warning or error (return code 0).

3. local installation: su, than root password

cp dupmerge /usr/bin/

chmod a+x /usr/bin/dupmerge

4. root logout: ctrl+D

Basic usage

Basic usage

Example I

dupmerge -h

This will give the online help of dupmerge, with all available options.

Example output:

dupmerge version 1.74
This program can reclaim disk space by linking identical files together.
It can also expand all hard links and reads the list of files from standard input.
Example usage: find ./ -type f -print0 | dupmerge 2>&1 | tee ../dupmerge_log.txt
In nodo/read-only mode the correct numbers of files which can be merged and (512 Byte) blocks which can
be reclaimed is on the todo list; the actual values are approx. 2 times higher.
-h   Show this help and exit.
-V   Show version number and exit.
-d   delete multiple files and hard links. Default: Preserve the alphabetically first file name.
-q   Quiet mode.
-n   Nodo mode / read-only mode.
-i   Inverse switch: Expand all hard links in normal mode, replace files by their desparsed version if it is bigger.
-s   Flag for soft linking (default is hard linking). This option is beta because for linking of all equal files more than one run of dupmerge is necessary and the inverse (expanding of soft links) is untested.
-S   Flag for Sparse mode: Replace files by their sparse version if it is smaller.
-c   Combo mode: Default mode +sparse mode. With -i it means inverse mode with unlinking and desparsing.

Example II

find ./ -type f -print0 | dupmerge 2>&1 | tee ../dupmerge_log.txt

This will hard link all equal files with size > 0 in this directory and all subdirectories, NOT following symbolic links. The output will be written to standard out and ../dupmerge_log.txt.
dupmerge prefers to keep the file with less blocks (e. g. a sparse file), or if they have the same number of blocks, the older one, or if they're the same age, the one with more (hard) links.
It's possible do use other linking policies, but they are not implemented and not planned yet, because good backup/syncing programs like rsync (with option -H) can preserve hard links.
File names with spaces, strange characters and even newlines are no problem because of zero termination of the file names.
It's a good idea to check the output to see the statistics and check if everything is ok, because dupmerge does not care e. g. about a corrupt file system or about the hard link count limit of the file or operating system (this is on the todo list).
I've used dupmerged many times with one million files of altogether 1 TB.
I also used it for CDs/DVDs and (compressed) backups because deduplication before compression saves a lot of space and generally deletion instead of linking is not a good idea, e. g. for equal driver files which are needed by hundreds of drivers in hundreds of different directories.
In an archive, e. g. a collection of CDs/DVDs copied on HDD, a common user usually has no rights for writing/deletion, so usually a superuser (root/administrator) has to start dupmerge in an archive.

The dupmerge output often shows interesting results, e. g. for the year DVDs from the german Linux Magazin: dupmerge shows many hidden and equal files like ._006-010_news_09.pdf, which contain only 82 Bytes and seem to be trash from creating the similar file, e. g. 003-003_editorial_09.pdf, in the same directory.

Example III

find ./ -type f -print0 | dupmerge -d

This will delete all duplicate files (including hard links) with size > 0 in the actual directory and all subdirectories, NOT following symbolic links.
This will preserve the alphabetically first file (including the path).
In the output of dupmerge only the deletion of hard links causes the message "freeing 0 blocks".

To delete all files of size 0 you can use

find ./ -type f -size 0 -exec rm -- {} \;

but you should first see the list of these files via

find ./ -type f -size 0 -exec ls -ilaF {} \;

and to delete empty directories recursively you can use

find . -depth -type d -empty -exec rmdir -- {} \;

but you should first see the (not recursive) list of these directories via

find ./ -type d -empty

Example IV

We have some equal files in exampledir and exampledir2, with the same times. To delete the files, which are both in exampledir and exampledir2, only in the directory exampledir2, we need the exampledir alphabetically behind exampledir2, because dupmerge preserves the alphabetically first file (including the path). A workaround would be using a soft link alphabetically behind exampledir2, but that would mean we have to call find with the -L option and this tells find to follow all symbolic links and show the properties of the file to which the link points, not from the link itself. So that's not a good solution. A better solution is to rename exampledir to switch the alphabetic order:

mv -i exampledir z_last
find exampledir2 z_last -type f -print0 | dupmerge -d
mv z_last exampledir

Because this deletes all duplicates, it also deletes duplicates inside exampledir. If you don't want this, you can use a (temporary) working copy or mounting the exampledir read-only:

mount -r --rbind /tmp/exampledir /tmp/foobar

But the last solution produces error messages because of failed deletion of read-only mounted files.

Advanced Usage

Example 1

find ./ -type f -print0 | dupmerge -S 2>&1 | tee ../dupmerge_log`date +%y%U%T`.txt

This will replace files with size > 0 in the actual directory and all subdirectories by their sparse version if the sparce version is smaller.
This is useful for image files with many zeros in it.
The rest behind -S duplicates the messages from dupmerge to standard output an the logfile ../dupmerge_log`date +%y%U%T`.txt.
The `date +%y%U%T` writes the actual last two digits of the actual year, week number of year, and time (hour, minute, second) into the log file name.

Example 2

find ./ -type f -print0 | dupmerge -c

Combo Mode (sparce +duplicate merging): This will replace files with size > 0 in the actual directory and all subdirectories by their sparse version if the sparce version is smaller.
Additionally all duplicate files of size > 0 get hard linked.
This is the best method of file based deduplication, which can be used on most file systems!

Example 3

find ./ -type f -print0 | dupmerge -i

This is the inverse mode: All hard links (to more than one file) get expanded to ordinary files, with the same metadata (modification time, etc.).

Example 4

find ./ -type f -print0 | dupmerge -s

Instead of hard linking in default mode (without parameters) this does soft linking of duplicate files of size > 0 in the actual directory and all subdirectories.

Example 5 and further hints

find . \( -iname "*.avi" -o -iname "*.flv" -o -iname "*.mkv" -o -iname "*.vmv" -o -iname "*.wmv" -o -iname "*.mpg" -o -iname "*.mpeg" \) -print0 | dupmerge -d

This example looks for video files by the file infix and suffix name (case insensitive), not by mime type, in the actual directory and all subdirectories and deletes duplicate files.
This is very useful e. g. for a full HDD/SSD after downloading many files e. g. at a LAN party, because usually video files are the biggest files and do allocate a lot of HDD/SSD space.
And for video files the hard or soft linking of duplicate files instead of deletion usually does not make sense, except in an archive, e. g. a collection of CDs/DVDs copied on HDD/SSD.

An alternative is using the mime type of the files, e. g. by using "file -i" or "file -b", but that's not easy, because file -i reports the videos as video or application/x-shockwave-flash or application/octet-stream or application/vnd.rn-realmedia or maybe other and file -b is not better.
But generally searching by mime type is much more reliable than by file name. An example is searching for text files wich contain the string "example":

find ./ -type f -print0 | xargs -0 file -iNf - | grep text | cut -d: -f1 | xargs -i{} grep -l "example" "{}"

Without the grep option -l it shows the matching lines (with the string "example").

The example 5 does not work with the option -n instead of -d, when the duplicate files are already hard linked, because this means ignore hard links (files with more than two hard links). The reason is that without a mode option (-s for soft linking etc.) dupmerges is in default mode (hard linking). So the example 5 works with no action in deletion mode (nodo mode) also for hard links with the options -n -d.

If the example 5 does not help reducing the disk usage enough, it's a good idea to take a look at the 50 biggest files by

find / -type f -printf '%s %p\n' | sort -srn | head -50 | tee 50biggestfiles.txt

and check the file list (50biggestfiles.txt).
BTW: This listing is also good for getting video and music files from the browser cache, espacially when you can't simply store the files: First you have to empty the cache (in the preferences), disable the memory cache (e. g. Opera has this option), change to the chache directory, e. g. ~/.opera/cache/ under Linux, and do

find ./ -mount -type f -printf '%s %p\n' | sort -srn | head -5

when the file is buffered, e. g. at the last seconds of the stream. Than you should copy that file before the chached gets deleted.

Anoter way is to look at the biggest directories with the function dud, which i define as the alias

alias dud='du --si --max-depth=1 "${@-.}"|sort -rh'

in the ~/.bashrc.
Under OpenSuSE i could see that this often does not work because the numbers are used with commas instead of dots (german locale). In this case you can use

alias dud='du -B 1000 -c --max-depth=1 "${@-.}"|sort -gr'

The Konqueror "File Size View" (2D plot of the disc usage) and programs like JDiskReport are another good way to take a look at the disk usage.

If find or dupmerge cause too much I/O load you should start the command line with "ionice -c3 -n7 -p $$; renice +19 -p $$; ".
In a (Bash) script you can insert it as two separate lines:
renice +19 -p $$
ionice -c3 -n7 -p $$

For further loseless data reduction a compression can be used. The transparent way is using a deduplicating and compressing file system like ZFS or LessFS. This can be done by setting the compressed attribute (c) e. g. via chattr. Another posibility is using Copy-on-Write, which is also a file attribute (C).
The other way is using a compressing program like zip or zpaq. zip is also an archiver. Example with packing/unpacking:

zip -r <output-flile, e. g.> <input-file, e. g. ~/mail >

unzip <input-flile, e. g. >

Zpaq is newer and one of the best compressing programs, but approx. 10 times slower than zip. The level 1 standard was finalized on Mar. 12, 2009. Because Zpac can only compress, for packing/unpacking an archiver like tar is necessary.
Example with maximum compression and packing/unpacking:

tar -Scf <output-flile, e. g. mail.tar> <input-file, e. g. ~/mail>
zpaq c/usr/share/doc/zpaq/examples/max.cfg <output-flile, e. g. mail.zpaq> <input-flile, e. g. mail.tar >

zpaq x <input-flile, e. g. mail.zpaq>
tar -Sxf <input-flile, e. g. mail.tar>

The file /usr/share/doc/zpaq/examples/max.cfg is part of the zpaq package.
If compression or decompression is too slow, you can use parallel program versions, which do use more than one CPU/GPU core, or Splitjob.