This tutorial is about the freitag version of dupmerge (dupmerge2 on sourceforge.net),
version 1.73 or later.
The author successfully tested dupmerge, compiled with gcc, under 32 and 64 Bit
Linux and with Cygwin under 32 and 64 Bit MS-Windows (32 Bit Win
XP, 64 Bit Win 7).
The principle is to select the files for duplicate merging (or
unmerging) with find from
GNU Findutils and to process these files
with a side effect of sorting. So the files are compared direct
and not with checksums to avoid problems like hash collisions.
Installation
1. Download and if it's a zip package, unzip with "unzip <package>".
2. Compilation (with preprocessing, linking etc.): gcc -D_GNU_SOURCE -Wall -O3 -o dupmerge dupmerge.c
gcc should give no warning or error (return code 0).
3. local installation: su, than root password
cp dupmerge /usr/bin/
chmod a+x /usr/bin/dupmerge
4. root logout: ctrl+D
Basic usage
Basic usage
Example I
dupmerge -h
This will give the online help of dupmerge, with all available options.
Example output:
dupmerge version 1.74
This program can reclaim disk space by linking identical files together.
It can also expand all hard links and reads the list of files from standard input.
Example usage: find ./ -type f -print0 | dupmerge 2>&1 | tee ../dupmerge_log.txt
In nodo/read-only mode the correct numbers of files which can be merged and (512 Byte) blocks which can
be reclaimed is on the todo list; the actual values are approx. 2 times higher.
Options:
-h Show this help and exit.
-V Show version number and exit.
-d delete multiple files and hard links. Default: Preserve the alphabetically first file name.
-q Quiet mode.
-n Nodo mode / read-only mode.
-i Inverse switch: Expand all hard links in normal mode, replace files by their desparsed version if it is bigger.
-s Flag for soft linking (default is hard linking). This option is beta because for linking of all equal files more than one run of dupmerge is necessary and the inverse (expanding of soft links) is untested.
-S Flag for Sparse mode: Replace files by their sparse version if it is smaller.
-c Combo mode: Default mode +sparse mode. With -i it means inverse mode with unlinking and desparsing.
Example II
find ./ -type f -print0 | dupmerge 2>&1 | tee ../dupmerge_log.txt
This will hard link all equal files with size > 0 in this directory and all subdirectories,
NOT following symbolic links. The output will be written to
standard out and ../dupmerge_log.txt.
dupmerge prefers to keep the file with less blocks
(e. g. a sparse file), or if they have the same number of blocks, the older one,
or if they're the same age, the one with more (hard) links.
It's possible do use other linking policies, but they are not
implemented and not planned yet, because good backup/syncing
programs like rsync (with option -H) can preserve hard links.
File names with spaces, strange characters and even newlines are no problem because of zero termination of
the file names.
It's a good idea to check the output to see the statistics and check if everything is ok, because
dupmerge does not care e. g. about a corrupt file system or about the hard link count limit of the file or
operating system (this is on the todo list).
I've used dupmerged many times with one million files of altogether 1 TB.
I also used it for CDs/DVDs and (compressed) backups because deduplication
before compression saves a lot of space and generally deletion instead
of linking is not a good idea, e. g. for equal driver files which
are needed by hundreds of drivers in hundreds of different directories.
In an archive, e. g. a collection of CDs/DVDs copied on HDD, a common
user usually has no rights for writing/deletion, so usually a superuser
(root/administrator) has to start dupmerge in an archive.
The dupmerge output often shows interesting results, e. g. for the
year DVDs from the german Linux Magazin: dupmerge shows many
hidden and equal files like ._006-010_news_09.pdf, which contain
only 82 Bytes and seem to be trash from creating the
similar file, e. g. 003-003_editorial_09.pdf, in the same directory.
Example III
find ./ -type f -print0 | dupmerge -d
This will delete all duplicate files (including hard links) with size
> 0 in the actual directory and all subdirectories, NOT
following symbolic links.
This will preserve the alphabetically first
file (including the path).
In the output of dupmerge only the deletion of hard links causes the
message "freeing 0 blocks".
To delete all files of size 0 you can use
find ./ -type f -size 0 -exec rm -- {} \;
but you should first see the list of these files via
find ./ -type f -size 0 -exec ls -ilaF {} \;
and to delete empty directories recursively you can use
find . -depth -type d -empty -exec rmdir -- {} \;
but you should first see the (not recursive) list of these directories
via
find ./ -type d -empty
Example IV
We have some equal files in exampledir and exampledir2, with the
same times.
To delete the files, which are both in exampledir and exampledir2,
only in the directory exampledir2, we need the exampledir
alphabetically behind exampledir2, because dupmerge preserves the
alphabetically first file (including the path).
A workaround would be using a soft link alphabetically behind exampledir2,
but that would mean we have to call find with the -L option and this
tells find to follow all symbolic links and
show the properties of the file to which the link points, not from the link itself.
So that's not a good solution.
A better solution is to rename exampledir to switch the alphabetic order:
mv -i exampledir z_last
find exampledir2 z_last -type f -print0 | dupmerge -d
mv z_last exampledir
Because this deletes all duplicates, it also deletes duplicates inside
exampledir.
If you don't want this, you can use a (temporary) working copy or
mounting the exampledir read-only:
mount -r --rbind /tmp/exampledir /tmp/foobar
But the last solution produces error messages because of failed
deletion of read-only mounted files.
Advanced Usage
Example 1
find ./ -type f -print0 | dupmerge -S 2>&1 | tee ../dupmerge_log`date +%y%U%T`.txt
This will replace files with size > 0 in the actual directory and all subdirectories
by their sparse version if the sparce version is smaller.
This is useful for image files with many zeros in it.
The rest behind -S duplicates the messages from dupmerge to standard output an the logfile
../dupmerge_log`date +%y%U%T`.txt.
The `date +%y%U%T` writes the actual last two digits of the actual
year, week number of year, and time (hour, minute, second) into the log file name.
Example 2
find ./ -type f -print0 | dupmerge -c
Combo Mode (sparce +duplicate merging): This will replace files with size
> 0 in the actual directory and all subdirectories
by their sparse version if the sparce version is smaller.
Additionally all duplicate files of size > 0 get hard linked.
This is the best method of file based deduplication, which can be used on most
file systems!
Example 3
find ./ -type f -print0 | dupmerge -i
This is the inverse mode: All hard links (to more than one file) get expanded to ordinary files,
with the same metadata (modification time, etc.).
Example 4
find ./ -type f -print0 | dupmerge -s
Instead of hard linking in default mode (without parameters) this does soft linking of duplicate
files of size > 0 in the actual directory and all subdirectories.
Example 5 and further hints
find . \( -iname "*.avi" -o -iname "*.flv" -o -iname "*.mkv" -o -iname
"*.vmv" -o -iname "*.wmv" -o -iname "*.mpg" -o -iname "*.mpeg" \) -print0 | dupmerge -d
This example looks for video files by the file infix and suffix name
(case insensitive), not by mime type, in the actual directory and all
subdirectories and deletes duplicate files.
This is very useful e. g. for a full HDD/SSD after downloading many files
e. g. at a LAN party, because usually video files are the biggest
files and do allocate a lot of HDD/SSD space.
And for video files the hard or soft linking of duplicate files instead of
deletion usually does not make sense, except in an
archive, e. g. a collection of CDs/DVDs copied on HDD/SSD.
An alternative is using the mime type of the files, e. g. by using
"file -i" or "file -b", but that's not easy, because file -i reports the videos
as video or application/x-shockwave-flash or application/octet-stream or
application/vnd.rn-realmedia or maybe other and file -b is not
better.
But generally searching by mime type is much more reliable than by
file name. An example is searching for text files wich contain the string
"example":
find ./ -type f -print0 | xargs -0 file -iNf - | grep text | cut -d: -f1 | xargs -i{} grep -l "example" "{}"
Without the grep option -l it shows the matching lines (with the string
"example").
The example 5 does not work with the option -n instead of -d, when the
duplicate files are already hard linked, because this means
ignore hard links (files with more than two hard links). The
reason is that without a mode option (-s for soft linking etc.) dupmerges is in
default mode (hard linking). So the example 5 works with no action
in deletion mode (nodo mode) also for hard links with the options -n -d.
If the example 5 does not help reducing the disk usage enough, it's a good
idea to take a look at the 50 biggest files by
find / -type f -printf '%s %p\n' | sort -srn | head -50 | tee 50biggestfiles.txt
and check the file list (50biggestfiles.txt).
BTW: This listing is also good for getting video and music files from
the browser cache, espacially when you can't simply store the
files: First you have to empty the cache (in the
preferences), disable the memory cache (e. g. Opera has this
option), change to the chache directory, e. g. ~/.opera/cache/
under Linux, and do
find ./ -mount -type f -printf '%s %p\n' | sort -srn | head -5
when the file is buffered, e. g. at the last seconds of the stream.
Than you should copy that file before the chached gets deleted.
Anoter way is to look at the biggest directories with the function
dud, which i define as the alias
alias dud='du --si --max-depth=1 "${@-.}"|sort -rh'
in the ~/.bashrc.
Under OpenSuSE i could see that this often does not work because the
numbers are used with commas instead of dots (german locale). In
this case you can use
alias dud='du -B 1000 -c --max-depth=1 "${@-.}"|sort -gr'
The Konqueror "File Size View" (2D plot of the disc usage) and programs like
JDiskReport are another good way to take a look at the disk usage.
If find or dupmerge cause too much I/O load you should start the
command line with "ionice -c3 -n7 -p $$; renice +19 -p $$; ".
In a (Bash) script you can insert it as two separate lines:
renice +19 -p $$
ionice -c3 -n7 -p $$
For further loseless data reduction a compression can be used. The transparent way
is using a deduplicating and compressing file system like ZFS or
LessFS. This can be done by setting the compressed attribute (c)
e. g. via chattr. Another posibility is using Copy-on-Write, which is
also a file attribute (C).
The other way is using a compressing program like zip or zpaq.
zip is also an archiver. Example with packing/unpacking:
zip -r <output-flile, e. g. mail.zip> <input-file, e. g. ~/mail >
unzip <input-flile, e. g. mail.zip >
Zpaq is newer and one of the best compressing programs, but
approx. 10 times slower than zip. The level 1
standard was finalized on Mar. 12, 2009. Because Zpac can only
compress, for packing/unpacking an archiver like tar is
necessary.
Example with maximum compression and packing/unpacking:
tar -Scf <output-flile, e. g. mail.tar> <input-file, e. g. ~/mail>
zpaq c/usr/share/doc/zpaq/examples/max.cfg <output-flile, e. g. mail.zpaq> <input-flile, e. g. mail.tar >
zpaq x <input-flile, e. g. mail.zpaq>
tar -Sxf <input-flile, e. g. mail.tar>
The file /usr/share/doc/zpaq/examples/max.cfg is part of the zpaq package.
If compression or decompression is too slow, you can use parallel program versions, which do use more than one CPU/GPU core, or
Splitjob.
Homepage
Sitemap