Yesterday I wrote about listing unique lines of an unsorted file or pipe. That post contained a trivial Python script which lists unique lines of its input without sorting. Then I wrote that such program could be faster than sort -u, which obtains the same lines but sorts them, and could be more useful in some situations. So I decided to implement it in C.
The Python script looks like this:
import sys last = set() for line in sys.stdin: if line not in last: last.add(line) sys.stdout.write(line)
It is trivial to implement in C, except that it uses a set. Here set is just a hash table with any values and keys which are input lines. Since the only implementation of hash tables in C standard library requires stating its maximal size and does not support resizing (except for copying data to a larger hash table, but having more than one hash table with this is a GNU extension), I decided to use GLib’s hash table implementation. Since GLib is used by GTK+, I decided to wrote the same program also in C++ with Qt, to compare which one would be more elegant and faster.
To test performance of these programs, I used a file containing all IP addresses from this blog’s access logs (obtained by cut -d ' ' -f 1 access_log > ip-addresses). Since this was small, I copied this ten times to make a larger input file. It is nearly real world data and I remember using sort | uniq with such data. IP addresses can be sorted, but their order does not have any significant meaning here. Therefore I will measure the performance of my programs using just commands like time ./guniq < ips > ips-guniq and computed average real time of three runs. To check if results of all these programs all the same, I used diff.
For the Python implementation average time was 0.306 second. For the C one with GLib it was 0.207 second. For program written in C++ with Qt it took 0.267 second.
For the same file sort -u took 1.581 second. This clearly shows that getting unique lines of a file is faster without sorting, at least with this file. Performance of my three programs may result from lack of appropriate optimizations, for the C++ one copying of all input lines look as good reason for its longer average run time. And in case of elegance, in my opinion the Python script is best and the C++ program looks slighly simpler than the C one. Most probably the C++ program could be simpler if it used I/O interfaces typical to C++, but in this case it was simpler to use the same ones as in C.
All these programs are available in my Mercurial repository, they are licensed under GNU General Public License, version 3 or later.
