Probablistic fast file fingerprinting (pfff) is a tool by Konstantin Tretyakov meant for file comparison of large files. Currently MD5 is used, which requires reading the full origin and target files. Reading full files can become very slow when dealing with large files and large repositories, or when the transport layer is very slow, such as with a network stack, or USB.
Pfff makes use of the variation within files to quickly asses whether files are the same, or whether they are different. Pfff does this by using samples of the file, the finger prints, and calculating a unique identifier based on these samples. This works when large files are highly variable. Typical use cases are video, audio, photo, geograpical, meteorological, and biological data. With biology, so called ‘big data’ is gathered by biologists, through genomics sequencing, transcriptomics, proteomics etc.
We are using pfff to for copying files across networks, and for Cloud and GRID computing. We are using it to find duplicate files in repositories. We are using it to see when files get truncated during copy. I even use it to copy files to my GPS over USB, a tediously slow process, to check whether the copy has succeeded, and to find duplicates in my music collection.
The chance of a collision (finding the same pfff identifier with two different files) is in the same order as with MD5. But where MD5 is slow, pfff is fast, particularly with large files, and/or a slow transport layer, but also on flash SSD drives(!), because pfff gets faster when seek times are low.
The only time you may not want to use pfff is when validating the full correctness of a file (as in the case of a backup, or when avoiding malware). In that case there is no solution but to read the full file.
In practise, with modern hardware and software techniques, it is very unlikely for a single byte (a mutation) to change in a file, without other side-effects (hardware errors, parser errors). What is likely, however, is truncation, incomplete transfers, and/or partial deletion of files. Pfff will always find these, because finger prints are guaranteed to change when data shifts.
pfff is free software, and written in C++ for Linux, OS X and Windows. It can be downloaded and compiled from [github](https://github.com/pfff/pfff). Soon we expect pfff to be available in the major software distributions.