Perfect Match

Home | Usage | Quality | Competition | Project

Contents


Simple usage

The simplest case: pmatch . removes duplicate files from current directory (and all subdirectories):

% pmatch .
rm ./path/to/file1
rm ./yet/another/duplicate/file
...

pmatch itself will not delete any files, but it will (by default) generate a script to remove duplicates. The script will affect all but one duplicate file - so in theory you should not loose any data. If you trust pmatch you can pipe it to bash for immediate execution: pmatch . | bash


Custom script

In case you want to do something else than deleting files, you may find -c option useful. Say we would like to copy duplicate files to /tmp directory:

% pmatch -c 'cp #{d} /tmp' .
cp ./path/to/file1 /tmp
cp ./yet/another/duplicate/file /tmp
...

Don't use this option on real system - it will not take care about duplicate file names.

After -c switch provide a string that will be generated for every (but one!) duplicate file. Do not quote the filename, it will be written out with all the weird characters escaped. Pay attention to #{d} - this fragment will be replaced with the current file name. d is the shortcut for duplicate - it will let you access currently processed file marked as duplicate. For one or more duplicate there will be exactly one file marked as original - you can access it's filename with #{o}. The rules for deciding which file is original are described below.

You can use it to generate commands that need both filenames - for example to replace duplicate files with symlinks:

pmatch tmp -c"rm #{d} && ln -s #{o.fullpath} #{d}"

Instead of using #{o} I have used #{o.fullpath} to return full path to the filename instead of relative one. #{o} by default (the same for #{d}) will return a path to the file relative to the path you have provided while running pmatch. That could cause symlinks to be generated as broken - full path will fix that problem.


Advanced script

You can put any valid ruby code between { and }. For example the following code will copy all duplicate files and make them uppercase by the way:

% pmatch -c 'cp #{d} /tmp/#{File.basename(d.to_s).upcase}' .
cp ./path/to/file1 /tmp/FILE1
cp ./yet/another/duplicate/file.png /tmp/FILE.PNG

Which file to mark as 'original'?

The same files are grouped together. Then, one of them is marked as 'original', the rest are duplicates. Perfect Match will let you influence the decision which file become the original using directory priorities and set of 'secondary choices'.

You can provide more than one path to pmatch. This will cause more directories to be scanned but also it will affect the way pmatch chooses which duplicate should not be marked for deletion. The order of directories provided dictates priority. If you run pmatch dirA dirB and the same file will exist in both dirA and dirB, the one from dirA will be marked as original.

Let's say you want to clean up your collection of OGGs. You have thousands of them stored in the ~/music directory - and you suspect it's full of duplicates. Parts of your collection is nicely sorted in ~/music/sorted and the rest is dropped into ~/music/rest. To remove duplicate files, but only from ~/music/rest you can simply:

pmatch music/sorted music/rest
That leads of to two problems:

You can either ignore the problem if you don't really care - and random file will be marked as 'original' or you can fine-tune script using secondary options. Here are your choices:


Cache

Since v0.3 pmatch features cache mechanism. The next time you run it with the same parameters and there were no changes in scanned directories, the cache will be used. Cache files are stored in ~/.pmatch, every now and then pmatch will clear them so they will not eat up your space. At the moment cache files will be deleted if there are more than 50 cache files in ~/.pmatch.

You can change some options and cache will still be used. For example you can scan directories ones, and then play with --exclude-pattern to remove some files from the output. Or maybe you need some changes in --command or --secondary-choice - that's fine.

If you only want to clear cache, run: pmatch -C clear. All cache related options:


Excludes

With -e (or --exclude) option you can tell pmatch to ignore some files. The argument for exclude is Perl Compatibile Regular Expression, without slashes at the beginning and the end. Exclude pattern is case-insensitive and is applied to the whole path, not just the file name. You can specify as many -e options as you want.

For example - to exclude all files ending with .bak use pmatch -e '\.bak$' .. I have used \. for dot because in PCRE dot (.) has a special meaning - it means "any character". Dollar sign ($) at the end is another special character - it matches the end of the string. Finally, I have wrapped regular expression with the quotes - just in case.

To exclude all files within dir1 and dir2 directories, you could use pmatch -e '/dir1/' -e '/dir2/' ..


Finally to see all the options run pmatch --help

% pmatch --help
Usage: ./pmatch [options] dir1 dir2 dir3 ...
Or:    ./pmatch -C clear

Specific options:
    -v, --verbose                    Run verbosely
    -q, --quiet                      Run quietly
    -e, --exclude PATTERN            Exclude files matched by regular expressions
    -s, --secondary-choice x,y,z     Which files should I prefer? 
                                     Possible values: short, long, deep, shallow, dirfull, dirempty, random
    -c, --command COMMAND            Command to display for every (but one) non-unique file
    -f, --outfile FILE               File to save generated statements. Will overwrite existing file!
        --md5-path PATH              Path to md5sum utility
    -C, --cache OPTION               Available OPTIONs: clear, off, on, force. Default: on.

Theme taken from Dr Nic Williams, who took it from Paul Battley.

I recommend: vacation by the Baltic Sea.