Implementing an ‘unsorted uniq’ called ‘ununiq’

| No TrackBacks

In this week I described one specific features of the standard Unix program uniq – it counts only adjacent identical lines as duplicates, although it would be technically interesting to implement such program without this restriction while not sorting the input. Then I wrote three implementations of such program and compared their performance on unpublished private data based on my blog’s access log.

Today I decided to make one of these three programs a nearly replacement for sort | uniq for situations where fast online algorithm without changing the order of input lines is better then possibly smaller memory use. For this I implemented support for some of uniq’s command line options and plan to write a man page.

I named the program ununiq, since this name is easy to pronounce and opposite of uniq. The program’s source is available in my Mercurial repository. Like most of my programs it is licensed under the GNU General Public License, version 3 or later.

Choosing programming language

The choice between three implementations of the basic algorithm – one in C with GLib, one in C++ with Qt and one in Python – was obvious for me. I wanted it to be:

  • faster than sort | uniq (all three programs satisfied this on the test data)
  • easy to write and maintain
  • available on all of my computers without installing too much dependencies (so without Qt, my server doesn’t need it).

Performance of my program in C with minimal C++ and Qt was worse than the one in C with GLib, so instead of using more C++ I decided to use the program written in C with GLib.

Command line options of uniq

Before this project I used an option of uniq only once, when a complicated pipe counting the number of secondary schools in each province from a CSV file contained sort | uniq -c. Now I know how to use other options of this program.

Now my program supports input and output with named files, and also skipping several leading characters of each line (the -d option). I implemented also the -d option which outputs only repeated lines, but it writes them on the second occurrence. The program would be more complicated with this option while preserving the original order, so it currently ignores this order.

The options are accepted both as short options and GNU long ones. Their support was very easy to implement using GLib command line option parser. Long options have the same names as in GNU Coreutils.

Future

Several currently not implemented options will require reading whole input before starting output, e.g. -c which counts how many times a line appears in the input. It should still be faster than sort | uniq. A similar observation may be made when sorting the output of ununiq, when operating on the perltoc man page ununiq | sort is faster by half than sort | uniq, although both make the same output.

The program needs also more documentation and testing. It would be nice with automatic testing of all options, which would also be very helpful for porting to other operating system.

No TrackBacks

TrackBack URL: http://blog.mtjm.eu/cgi-bin/mt/mt-tb.cgi/11