The common Unix program uniq by defaults outputs its input with repeated lines omitted. It is useful in many situations, but it supports only sorted files.
In many cases it is appropriate to sort the file, by sort | uniq or sort -u. But there are situations where the order of the lines is important. How to get the first occurrence of each line in such case, in the original order?
This trivial Python script shows how uniq might work without any options:
import sys last = None for line in sys.stdin: if line == last: continue else: last = line sys.stdout.write(line)
When tested on the largest man page in my system, perltoc.1, it gave the same output as uniq (although it was slower). This algorithm needs to store only two lines at once and is completely online, i.e. to output a line it does not need following input lines. It is clearly better for infinite texts, but for finite ones it could be made more useful when storing all previous lines.
This script stores all previous lines in a set:
import sys last = set() for line in sys.stdin: if line in last: continue else: last.add(line) sys.stdout.write(line)
For sorted file it produces exactly the same output as uniq or the previous script. For unsorted files it omits repetitions of all previous lines, not only the last one.
This algorithm also is online, since it stores only previous lines, but it has an disadvantage in comparison with other ones – it uses more memory than is used for storing the file (for efficiency reasons the set as a hash table is larger than its contents).
This could be a problem, but sorting also requires storing the whole file in memory (excluding more complicated external sorting). So when the order of lines is not important, this script would use similar amount of memory as sort -u, but it would start its output much more quickly. With optimized implementation in C it would be even faster then sort -u (since all sorting algorithms need more time than linear proportional to the input size, while this could with appropriate hash table implementation usually work in time proportional to input size).
