The standard POSIX program uniq may list repeated lines from its output. Its implementation in GNU Coreutils supports listing all occurrences of such lines. How could this be used to list files of the same content located in a given directory?
The solution is to input to uniq with appropriate options a file consisting of lines isomorphic to the contents of files to be compared. Since uniq compares only adjacent lines, this file would have to be sorted.
Therefore a one-to-one function from files to lines should be used. Cryptographic hash functions are treated as such, although they aren’t – they accept any finite byte sequence as input and output a constant size byte sequence. There are many hashes which are now considered insecure enough for important systems (e.g. when it is easy to obtain reverse mappings or different inputs with the same output), but SHA-512 is currently used in such systems. GNU Coreutils have a program called sha512sum computing this hash of files given on the command line, so it can be easily used for this task.
Each output line of this program consists of 512 bit hexadecimal hash (i.e. 128 hexadecimal digits) and the file name, separated by several spaces. Clearly, it would be useful to know the names of repeated files, not only their SHA-512 hashes, so the output of sha512sum will be wholly passed to uniq.
The list of files for sha512sum can be generated by find $dir -type f where $dir is any directory. This command will output each file in this directory or its subdirs. The test -type f requests it to output only regular files, since they are the only ones with content.
The whole pipe listing names of duplicate files is:
find $dir -type f -print0 | xargs -0 sha512sum \ | sort | uniq -w 128 -d --all-repeated=separate \ | sed 's/^[0-9a-f]\+ \+//'
The command xargs passes its input as arguments to sha512sum. Since file names may contain spaces, the options -print0 of find and -0 of xargs will request them to separate the file names with zero bytes to avoid treating a file name with spaces as names of two files.
The arguments of uniq do things described before – -w 128 limits the comparison to first 128 bytes of each line, i.e. to the hash, -d omits unique lines from the output, and --all-repeated=separate outputs all repeated lines, separated by blank lines (of these three options only -d is required by POSIX and supported by uniqs used in BSDs). The final sed expression omits the hash from the output which is probably not useful.
Here sort is used only due to the way in which uniq works. It doesn’t look useful to have different files sorted according to their SHA-512 sums. It might be useful to have files in each duplicate group sorted alphabetically, but this probably could be done faster, since multiple sorts of small sequences are faster then a single sort of their sum (here also the output will be probably much smaller than the input – on my system running the above pipe on /usr/share/man gave only about one third of lines of the output of find /usr/share/man -type f, including extra blank lines). Programs like my ununiq find all repeated lines of an unsorted input, but it does not support the options necessary for this task.
