Recently in system administration Category

Finding duplicate files using GNU uniq

| No Comments | No TrackBacks

The standard POSIX program uniq may list repeated lines from its output. Its implementation in GNU Coreutils supports listing all occurrences of such lines. How could this be used to list files of the same content located in a given directory?

The solution is to input to uniq with appropriate options a file consisting of lines isomorphic to the contents of files to be compared. Since uniq compares only adjacent lines, this file would have to be sorted.

Therefore a one-to-one function from files to lines should be used. Cryptographic hash functions are treated as such, although they aren’t – they accept any finite byte sequence as input and output a constant size byte sequence. There are many hashes which are now considered insecure enough for important systems (e.g. when it is easy to obtain reverse mappings or different inputs with the same output), but SHA-512 is currently used in such systems. GNU Coreutils have a program called sha512sum computing this hash of files given on the command line, so it can be easily used for this task.

Each output line of this program consists of 512 bit hexadecimal hash (i.e. 128 hexadecimal digits) and the file name, separated by several spaces. Clearly, it would be useful to know the names of repeated files, not only their SHA-512 hashes, so the output of sha512sum will be wholly passed to uniq.

The list of files for sha512sum can be generated by find $dir -type f where $dir is any directory. This command will output each file in this directory or its subdirs. The test -type f requests it to output only regular files, since they are the only ones with content.

The whole pipe listing names of duplicate files is:

find $dir -type f -print0 | xargs -0 sha512sum \
    | sort | uniq -w 128 -d --all-repeated=separate \
    | sed 's/^[0-9a-f]\+ \+//'

The command xargs passes its input as arguments to sha512sum. Since file names may contain spaces, the options -print0 of find and -0 of xargs will request them to separate the file names with zero bytes to avoid treating a file name with spaces as names of two files.

The arguments of uniq do things described before – -w 128 limits the comparison to first 128 bytes of each line, i.e. to the hash, -d omits unique lines from the output, and --all-repeated=separate outputs all repeated lines, separated by blank lines (of these three options only -d is required by POSIX and supported by uniqs used in BSDs). The final sed expression omits the hash from the output which is probably not useful.

Here sort is used only due to the way in which uniq works. It doesn’t look useful to have different files sorted according to their SHA-512 sums. It might be useful to have files in each duplicate group sorted alphabetically, but this probably could be done faster, since multiple sorts of small sequences are faster then a single sort of their sum (here also the output will be probably much smaller than the input – on my system running the above pipe on /usr/share/man gave only about one third of lines of the output of find /usr/share/man -type f, including extra blank lines). Programs like my ununiq find all repeated lines of an unsorted input, but it does not support the options necessary for this task.

Implementing an ‘unsorted uniq’ called ‘ununiq’

| No Comments | No TrackBacks

In this week I described one specific features of the standard Unix program uniq – it counts only adjacent identical lines as duplicates, although it would be technically interesting to implement such program without this restriction while not sorting the input. Then I wrote three implementations of such program and compared their performance on unpublished private data based on my blog’s access log.

Today I decided to make one of these three programs a nearly replacement for sort | uniq for situations where fast online algorithm without changing the order of input lines is better then possibly smaller memory use. For this I implemented support for some of uniq’s command line options and plan to write a man page.

I named the program ununiq, since this name is easy to pronounce and opposite of uniq. The program’s source is available in my Mercurial repository. Like most of my programs it is licensed under the GNU General Public License, version 3 or later.

Choosing programming language

The choice between three implementations of the basic algorithm – one in C with GLib, one in C++ with Qt and one in Python – was obvious for me. I wanted it to be:

  • faster than sort | uniq (all three programs satisfied this on the test data)
  • easy to write and maintain
  • available on all of my computers without installing too much dependencies (so without Qt, my server doesn’t need it).

Performance of my program in C with minimal C++ and Qt was worse than the one in C with GLib, so instead of using more C++ I decided to use the program written in C with GLib.

Command line options of uniq

Before this project I used an option of uniq only once, when a complicated pipe counting the number of secondary schools in each province from a CSV file contained sort | uniq -c. Now I know how to use other options of this program.

Now my program supports input and output with named files, and also skipping several leading characters of each line (the -d option). I implemented also the -d option which outputs only repeated lines, but it writes them on the second occurrence. The program would be more complicated with this option while preserving the original order, so it currently ignores this order.

The options are accepted both as short options and GNU long ones. Their support was very easy to implement using GLib command line option parser. Long options have the same names as in GNU Coreutils.

Future

Several currently not implemented options will require reading whole input before starting output, e.g. -c which counts how many times a line appears in the input. It should still be faster than sort | uniq. A similar observation may be made when sorting the output of ununiq, when operating on the perltoc man page ununiq | sort is faster by half than sort | uniq, although both make the same output.

The program needs also more documentation and testing. It would be nice with automatic testing of all options, which would also be very helpful for porting to other operating system.

Installing ReactOS on KVM virtual machine on Gentoo

| No Comments | 2 TrackBacks

It is useful to access more then one operating system when writing a program designed to be used with a different one. This is one of situations where platform virtualization is helpful. This post describes how I used KVM on a Gentoo GNU/Linux host to use ReactOS on a virtual machine.

In Gentoo’s Portage tree KVM userspace part is available as app-emulation/kvm. By default it uses its own modules, but this can be disabled by emerging it with USE flags havekernel -modules when the kernel has appropriate options enabled.

The ebuild states that user running KVM must be in the kvm group. So I added my user to this group using recommended gpasswd -a <USER> kvm To avoid logging out, I used su - <USER> in a terminal emulator, but this did not solve the problem. Then I read that this package also added rules for udev changing the group of /dev/kvm to kvm. Since I hadn’t restarted udev, I used chgrp kvm /dev/kvm as root to have this set before next reboot. Then KVM worked correctly.

KVM looks simple to use without reading the whole manual. The command kvm-img create c.img 4G made a disk image of $ GiB in the file named c.img, then I could use KVM with this by just specifying this file as a disk of the virtual machine. I used kvm -hda c.img -cdrom ReactOS.iso to install this system. It worked correctly and except for nicer keybindings similar to installation of Microsoft Windows.

After starting the newly installed system I noticed a trivial problem, kvm by default uses UTC time, while ReactOS (like Windows) supports only localtime. After checking it in the man page, I added the -localtime option to this command.

Since network access would be useful, I tried to configure it. Fortunately, KVM has not only a man page but also several HOWTOs on its website, one of them shows how to configure networking. I used user networking with additional options -net nic -net user, since it is the simplest to configure. But ReactOS didn’t have a driver for the default virtual NIC. When I had similar problem with Windows XP on real hardware, I changed the NIC to a different one, then another (from newest to oldest one), then to a one with a CD containing drivers. ReactOS website has a list of supported NICs which shows that most of them require downloading non-free drivers. But PCnet has included drivers and is supported by KVM, so I changed -net nic to -net nic,model=pcnet and it worked correctly.

Documentation of KVM shows that much more can be done in this case. It is nice to use working software with man pages encouraging learning about it.

Listing unique lines of an unsorted file or pipe

| No Comments | No TrackBacks

The common Unix program uniq by defaults outputs its input with repeated lines omitted. It is useful in many situations, but it supports only sorted files.

In many cases it is appropriate to sort the file, by sort | uniq or sort -u. But there are situations where the order of the lines is important. How to get the first occurrence of each line in such case, in the original order?

This trivial Python script shows how uniq might work without any options:

import sys

last = None
for line in sys.stdin:
    if line == last:
        continue
    else:
        last = line
        sys.stdout.write(line)

When tested on the largest man page in my system, perltoc.1, it gave the same output as uniq (although it was slower). This algorithm needs to store only two lines at once and is completely online, i.e. to output a line it does not need following input lines. It is clearly better for infinite texts, but for finite ones it could be made more useful when storing all previous lines.

This script stores all previous lines in a set:

import sys

last = set()
for line in sys.stdin:
    if line in last:
        continue
    else:
        last.add(line)
        sys.stdout.write(line)

For sorted file it produces exactly the same output as uniq or the previous script. For unsorted files it omits repetitions of all previous lines, not only the last one.

This algorithm also is online, since it stores only previous lines, but it has an disadvantage in comparison with other ones – it uses more memory than is used for storing the file (for efficiency reasons the set as a hash table is larger than its contents).

This could be a problem, but sorting also requires storing the whole file in memory (excluding more complicated external sorting). So when the order of lines is not important, this script would use similar amount of memory as sort -u, but it would start its output much more quickly. With optimized implementation in C it would be even faster then sort -u (since all sorting algorithms need more time than linear proportional to the input size, while this could with appropriate hash table implementation usually work in time proportional to input size).

Using distcc with Gentoo

| No Comments | No TrackBacks

In home I use a computer with a quad-core AMD Phenom processor. Outside I use a laptop with a slow dual-core AMD Athlon 64 and only 2 GiB of RAM. Both are using Gentoo GNU/Linux. Since I update software on the laptop less often then once per week, it spends much time compiling it. Today I decided to use distcc for the Phenom-based computer to do some of this work instead of the laptop.

I used the official Gentoo Distcc Documentation to configure it. In my case it is simpler then described there. On the ‘slave’ computer I just installed distcc by emerge distcc, then changed the allowed IPs to my IPs in /etc/conf.d/distccd and started the service by /etc/init.d/distccd start (then rc-update add distccd default requested it to be started at each boot).

On the laptop I did the above and also edited the list of hosts used in /etc/distcc/hosts and two parameters for Portage integration in /etc/make.conf. Both required some experimentation to have load average on both computers below their number of cores. In the first one I finally specified only the IP of the Phenom-using machine. In the second one I added distcc to FEATURES and set MAKEOPTS to -j9 which was recommended by the documentation when using 4 cores.

top shows that this works. I should also change this configuration for packages which do not support distcc.