Perfect Match
Home | Usage | Quality | Competition | Project
Contents
Simple usage
The simplest case: pmatch . removes duplicate files from current directory (and all subdirectories):
% pmatch . rm ./path/to/file1 rm ./yet/another/duplicate/file ...
pmatch itself will not delete any files, but it will (by default) generate a script to remove duplicates. The script will affect all but one duplicate file - so in theory you should not loose any data. If you trust pmatch you can pipe it to bash for immediate execution:
pmatch . | bash
Custom script
In case you want to do something else than deleting files, you may find -c option useful. Say we would like to copy duplicate files to /tmp directory:
% pmatch -c 'cp #{d} /tmp' .
cp ./path/to/file1 /tmp
cp ./yet/another/duplicate/file /tmp
...
Don't use this option on real system - it will not take care about duplicate file names.
After -c switch provide a string that will be generated for every (but one!) duplicate file. Do not quote the filename, it will be written out with all the weird characters escaped. Pay attention to #{d} - this fragment will be replaced with the current file name.
d is the shortcut for duplicate - it will let you access currently processed file marked as duplicate. For one or more duplicate there will be exactly one file marked as original - you can access it's filename with #{o}. The rules for deciding which file is
original are described below.
You can use it to generate commands that need both filenames - for example to replace duplicate files with symlinks:
pmatch tmp -c"rm #{d} && ln -s #{o.fullpath} #{d}"
Instead of using #{o} I have used #{o.fullpath} to return full path to the filename instead of relative one. #{o} by default (the same for #{d}) will return a path to the file relative to the path you have provided while running pmatch.
That could cause symlinks to be generated as broken - full path will fix that problem.
Advanced script
You can put any valid ruby code between { and }. For example the following code will copy all duplicate files and make them uppercase by the way:
% pmatch -c 'cp #{d} /tmp/#{File.basename(d.to_s).upcase}' .
cp ./path/to/file1 /tmp/FILE1
cp ./yet/another/duplicate/file.png /tmp/FILE.PNG
Which file to mark as 'original'?
The same files are grouped together. Then, one of them is marked as 'original', the rest are duplicates. Perfect Match will let you influence the decision which file become the original using directory priorities and set of 'secondary choices'.
You can provide more than one path to pmatch. This will cause more directories to be scanned but also it will affect the way pmatch chooses which duplicate should not be marked for deletion. The order of directories provided dictates priority. If you run pmatch dirA
dirB and the same file will exist in both dirA and dirB, the one from dirA will be marked as original.
Let's say you want to clean up your collection of OGGs. You have thousands of them stored in the ~/music directory - and you suspect it's full of duplicates. Parts of your collection is nicely sorted in ~/music/sorted and
the rest is dropped into ~/music/rest. To remove duplicate files, but only from ~/music/rest you can simply:
pmatch music/sorted music/restThat leads of to two problems:
- what if all duplicate files are in less-priritized dir (
music/rest) ? - what if there is more than one duplicate in
music/sorted?
You can either ignore the problem if you don't really care - and random file will be marked as 'original' or you can fine-tune script using secondary options. Here are your choices:
short- pmatch will prefer files with shorter filename. So havingaaa.txtandaaaaa.txtin the same folder,aaa.txtwill be marked as original (andaaaaa.txtpossibly deleted by your generated script).long- like above but pmatch will prefer files with longer filenamedeep- prefer files that are deeper in the filesystem hierarchy. I.e. fordir1/dir2/dir3/file1.txtanddir4/file2.txt, pmatch will markfile1.txtas originalshallow- prefer files that are not 'deeply' locateddirfull- prefer files that are located in directory with many other filesdirempty- prefer files that have least 'siblings'random- this one will automatically be added after all other given secondary choices - to make sure there will be only one file marked as original
Cache
Since v0.3 pmatch features cache mechanism. The next time you run it with the same parameters and there were no changes in scanned directories, the cache will be used. Cache files are stored in ~/.pmatch, every now and then pmatch will clear them so they will not eat
up your space. At the moment cache files will be deleted if there are more than 50 cache files in ~/.pmatch.
You can change some options and cache will still be used. For example you can scan directories ones, and then play with --exclude-pattern to remove some files from the output. Or maybe you need some changes in --command or --secondary-choice - that's
fine.
If you only want to clear cache, run: pmatch -C clear. All cache related options:
clearclear cache directoryondefault setting - use cache if possibleoffcompletely ignore cacheforceuse cache even if there were some changes in scanned directories
Excludes
With -e (or --exclude) option you can tell pmatch to ignore some files. The argument for exclude is Perl Compatibile Regular Expression, without slashes at the beginning and the end. Exclude pattern is case-insensitive and is applied to the whole path, not just the file name. You can specify as many -e options as you want.
For example - to exclude all files ending with .bak use pmatch -e '\.bak$' .. I have used \. for dot because in PCRE dot (.) has a special meaning - it means "any character". Dollar sign ($) at the end is another special character - it matches the end of the
string. Finally, I have wrapped regular expression with the quotes - just in case.
To exclude all files within dir1 and dir2 directories, you could use pmatch -e '/dir1/' -e '/dir2/' ..
Finally to see all the options run pmatch --help
% pmatch --help
Usage: ./pmatch [options] dir1 dir2 dir3 ...
Or: ./pmatch -C clear
Specific options:
-v, --verbose Run verbosely
-q, --quiet Run quietly
-e, --exclude PATTERN Exclude files matched by regular expressions
-s, --secondary-choice x,y,z Which files should I prefer?
Possible values: short, long, deep, shallow, dirfull, dirempty, random
-c, --command COMMAND Command to display for every (but one) non-unique file
-f, --outfile FILE File to save generated statements. Will overwrite existing file!
--md5-path PATH Path to md5sum utility
-C, --cache OPTION Available OPTIONs: clear, off, on, force. Default: on.
Theme taken from Dr Nic Williams, who took it from Paul Battley.
I recommend: vacation by the Baltic Sea.