Recently in programming Category

Comparing performance of common Unix shells

| No Comments | No TrackBacks

Probably most programs written by a user of a GNU/Linux operating system are scripts interpreted by programs inspired by the Bourne shell. Although most of their work is either interactive (so most probably faster than a human can see) or done by efficient C programs, it would be interesting to compare how the choice of shell affects the time needed to run some scripts.

Most GNU/Linux systems use Bash as their only shell. This is different on BSD derivatives like FreeBSD using ash for scripting and tcsh (a C shell derivative with largely different syntax than other shells) for interactive use.

Most shell scripts do not use specific features of any shell and need just a mostly-POSIX-compatible shell like dash (an ash derivative) or Bash. Therefore they specify /bin/sh as their interpreter, which is always such shell. In most GNU/Linux distributions Bash is used as /bin/sh, while in BSDs ash is used, and Ubuntu and Debian Squeeze use dash. Therefore many scripts using Bash-specific features declare incorrectly to be used with the default shell and fail on Ubuntu or FreeBSD.

Avoiding the above problem by testing scripts with shells having only the features required by POSIX is not the only reason to use non-Bash shells for scripting. The dash shell is faster then Bash, this is why it was proposed for Debian Lenny release to use dash as the default shell for scripts.

To check how time performance of different shells differs, I wrote several trivial scripts which can be interpreted by the popular POSIX-like shells. Two of the scripts calculate factorials using different recursive algorithms (one is the ‘standard’ definition used in mathematical textbooks, the other one is the tail-recursive one used in functional programming textbooks), another one calculates elements of the Fibonacci sequence using the recursive definition, the fourth one just calls the shell about one hundred times to check how slow is its initialization. I haven’t seen a real shell script doing such things, but the ones which I normally use depend mostly on other program performance or use Bash-specific features. Another script calculates average time spent by each script and shell combination from ten runs (one additional run of each is done before counting, since this needs loading the shell from the disk) and outputs the result in a simple to parse format.

I compared six shells available in Gentoo GNU/Linux ebuilds sys-apps/busybox-1.15.2, app-shells/bash-4.0_p35, app-shells/dash-0.5.5.1.2, app-shells/mksh-39, app-shells/pdksh-5.2.14-r4, app-shells/zsh-4.3.10. The average times in seconds on the machine which I’m using calculated by the script are:

Scriptbbdashbashzshmkshpdksh
tail-recursive factorial0.120.0830.2540.230.1220.117
standard factorial0.1060.0840.2290.2420.120.121
Fibonacci sequence1.0610.8012.1772.0631.0441.301
recursive shell invocation0.3410.2690.5151.910.3780.349

For all above tests dash is the fastest, BusyBox and Korn shell variants have similar performance, while Bash or zsh is the slowest one. Bash was two to three times slower than dash for these tests.

Of course, real scripts are something completely different. Probably everyone who wants to write functional programs knows more appropriate languages than POSIX shells. Also, extensions of many shells probably might make them faster for some scripts using them. The main reason for shell scripting is the ease of writing trivial scripts similar to commands written for daily interactive use. Therefore it is more useful to write a simple script and rewrite it in a better language when needed.

The scripts used for the above calculations are available in my Mercurial repository. The main script is licensed under the GNU General Public License, version 3 or later, while the tested scripts are public domain, since I hope that these are too unoriginal to be copyrightable.

Some limitations of popular free Web log analyzer software

| 2 Comments | No TrackBacks

It is useful for a blogger to know how their site is used. Understanding which information the users are searching for, which sites linked them to it, relationship between post’s popularity and weekday, might help making more useful content. But getting such information should not harm the users, i.e. not increase the amount of useless scripts which they must download and not waste time which the blogger might use to write useful texts or to communicate with others.

Sources of data

Most Web servers store in their access logs some data about each request, like the user’s IP address or the referring page URL. There are many formats of such data, but they all share three important things:

  • no additional work is done client-side
  • only data specified in the HTTP headers is used
  • all accesses are logged, including these from robots.

The problem with such data is that it does not specify some information known only by the user’s browser (called more formally an user agent), like screen resolution, support for JavaScript or some useless plugins. Other information coming from the user are trivially forged, malicious bots happily pretend to be real browsers coming by links from other pages.

A partial solution to this is to use JavaScript code and zero-sized images without caching to get these information from the client. But this requires more requests per page view (especially from different servers, these makes page loading much slower), and it ignores users who disable JavaScript or use browser extensions blocking such code for privacy/performance reasons.

Although there is nothing specific to free software, this situation leads to many problems with programs analyzing Web server logs.

Some uses of the data

These are several possible uses of data stored in the access logs:

  • finding which topics are popular and worth expansion
  • comparing posts with search keywords leading users to them, maybe they could be more useful for common visitors
  • blocking access for bots which do not benefit potential users and waste bandwidth
  • finding other blogs linking to the site, they might have useful information on similar topics
  • comparing effectiveness of different posting schedules
  • finding possible problems, like broken incoming links
  • determining how specific browsers or operating systems are popular among the readers

All of these might be used to make the site more useful. The programs should make it easy, but it is not as simple as it seems.

How spam makes it difficult

For most uses only data about human visitors is helpful. Only to block unfriendly bots or to correct technical problems data about bot visits is needed.

The problem is that only the useful bots want to be identified as bots. The ones which send spam, copy content to spam sites, get mail addresses to send spam, spam etc, do not want to be known – this would make it trivial to disallow their visits. So they pretend to use popular Web browsers and use many IP addresses without any clear pattern.

Many spam bots can be easily identified by using identifications of very old browsers (some of which could not access the site due to changes in the Web protocols), or by strange usage patterns like visiting only a single page referring from the same page and not getting any styles or images. They also go to URLs used by insecure Web applications and pretend to visit from certain sites in hope of getting a link to these sites (it is called referrer spam). This spam is useless in most cases, since the referrer URLs are not published on properly written sites excluding ones like password-protected log analyzer reports (with all links marked to be ignored by search engine crawlers). But it still makes the log analyzers less useful.

Problems of common log analyzers

One of the most visible things which I observed after visiting the Wikipedia list of Web log analyzers is that most of them are very old. Of the ones not using MySQL or PHP one had last release in 2004, another does not try to ignore visits by bots in statistics generated, using another one is the main inspiration for this post.

Clearly, identification of new browsers and operating systems, proper determination of queries from new (or renamed) search engines, and detection of malicious bots requires changes in software. So I believe that projects without new releases in this year do not detect new things and have problems making them less interesting to improve.

Another problem is that URLs are usually not unique for a given content, although they should. This is most common with forum software written in PHP, they use different URLs for each user. Therefore log analyzers treat each visit from a forum thread as a visit from a different page. This makes lists of referring URLs much less friendly to humans who are more interested in pages than their specific URLs.

There are probably no perfect solutions for the spam in statistics, but the programs could vastly decrease its amount by trivial measures.

Solutions

There are two methods of solving these problems – correcting an existing program or writing a new one. Since most of free software log analyzers are written in C, which is better for much different programs, or Perl, which is appropriate for much smaller programs and probably encourages committing some of their possible design mistakes, it would be difficult for me. Maybe it would be an interesting learning experience to write another faulty log analyzer?

Six tips for optimization of homework C programs

| No Comments | No TrackBacks

A nice aspect of formal computer science education is that it requires writing useless programs in a low-level programming language like C which pass automated tests with harsh time and memory limits. The following list shows several methods which I used to optimize such programs and examines their usefulness for ‘real’ programs.

1. Don’t optimize uncompleted programs

Eric S. Raymond explains this by the fact that incomplete programs are not understood completely, so a thing considered not an optimization may make the program slower when it will be completed. In case of small programs for batch processing of data there is another reason – slow but easy to understand program may generate output which will be compared to output of optimized versions.

For real programs there is another advantage – a simple prototype might be used to describe the design and allow much faster development than a fast program written in C. Once I wrote a trivial Python script to explain a limitation of standard Unix program uniq. It was so simple that in next few days I decided to implement a complete uniq-like program in C.

2. Don’t optimize fast enough programs

In case of homeworks the aim is usually to make a program passing some automated tests. If it is possible to test the program many times, it would be not useful to optimize it after all tests are passed in required time.

In real life programs are used with more than several different input files, so this argument does not apply to non-homework programs. But there the Moore’s Law makes most optimization pointless.

3. Measure before any change

It is very difficult for humans to predict performance effect of a change in a program. For example, C programs with a loop consisting of billions of pre- or postincrements of a variable saved to another variable produce on my machine assembly code differing only by the order of instructions in the loop. It looks obvious that these programs would work for the same amount of time, but the one with preincrements was faster by about 25% when using an AMD Phenom processor. In this case the correct solution was to enable compiler optimizations which removed the whole loop, but in many situations it is better to make small changes and compare their effect on the program performance.

In real Unix programs profilers are used for this, but for programs containing only one function it is appropriate to just measure the time spent running the whole program on a large input.

If a typical input takes one microsecond to process, then it might be necessary to write a script making much larger inputs. Since they would be too large to have output correctness verified by human, an unoptimized ‘correct’ implementation of the program being improved is useful to generate output for regression testing.

4. Don’t use dynamic memory allocation

As explained by Joel Spolsky, malloc and similar functions use a slow algorithm to decide which memory is free. This probably could have been improved since 2001, but still is slow. Fortunately many homework problems specify limits on input size which may allow using statically allocated arrays for everything.

In real programs malloc is often avoided by ignoring characters after the 2047 column of a line or by crashing the program when they occur. An advantage of many GNU programs is that they do not limit input length and dynamically allocate input buffers. Therefore this tip is mostly useless for non-homework problems.

5. Don’t copy memory

CPUs are thousands of times faster than memory. Therefore operations on large amounts of memory become the performance bottleneck of programs without significant I/O.

Performance of a program can be improved by using a single buffer for e.g. an input line, instead of copying it many times. In some cases code reading the input might do some calculations on it which will lead to storing much less data in memory. This might be a useful improvement in case of programs performing O(n) algorithms on these data.

6. Write small and simple programs

Since memory is slow, modern CPUs store some of it in faster and more expensive caches. Therefore smaller code might be stored completely in cache and avoid slow memory access, as explained by Raymond.

This is done mainly by removing optimizations designed for older processors. A nice example of this is the removal of many uses of Duff’s device from the XFree86 X11 server. As stated by Theodore Ts’o, ‘by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot’.

Finding duplicate files using GNU uniq

| No Comments | No TrackBacks

The standard POSIX program uniq may list repeated lines from its output. Its implementation in GNU Coreutils supports listing all occurrences of such lines. How could this be used to list files of the same content located in a given directory?

The solution is to input to uniq with appropriate options a file consisting of lines isomorphic to the contents of files to be compared. Since uniq compares only adjacent lines, this file would have to be sorted.

Therefore a one-to-one function from files to lines should be used. Cryptographic hash functions are treated as such, although they aren’t – they accept any finite byte sequence as input and output a constant size byte sequence. There are many hashes which are now considered insecure enough for important systems (e.g. when it is easy to obtain reverse mappings or different inputs with the same output), but SHA-512 is currently used in such systems. GNU Coreutils have a program called sha512sum computing this hash of files given on the command line, so it can be easily used for this task.

Each output line of this program consists of 512 bit hexadecimal hash (i.e. 128 hexadecimal digits) and the file name, separated by several spaces. Clearly, it would be useful to know the names of repeated files, not only their SHA-512 hashes, so the output of sha512sum will be wholly passed to uniq.

The list of files for sha512sum can be generated by find $dir -type f where $dir is any directory. This command will output each file in this directory or its subdirs. The test -type f requests it to output only regular files, since they are the only ones with content.

The whole pipe listing names of duplicate files is:

find $dir -type f -print0 | xargs -0 sha512sum \
    | sort | uniq -w 128 -d --all-repeated=separate \
    | sed 's/^[0-9a-f]\+ \+//'

The command xargs passes its input as arguments to sha512sum. Since file names may contain spaces, the options -print0 of find and -0 of xargs will request them to separate the file names with zero bytes to avoid treating a file name with spaces as names of two files.

The arguments of uniq do things described before – -w 128 limits the comparison to first 128 bytes of each line, i.e. to the hash, -d omits unique lines from the output, and --all-repeated=separate outputs all repeated lines, separated by blank lines (of these three options only -d is required by POSIX and supported by uniqs used in BSDs). The final sed expression omits the hash from the output which is probably not useful.

Here sort is used only due to the way in which uniq works. It doesn’t look useful to have different files sorted according to their SHA-512 sums. It might be useful to have files in each duplicate group sorted alphabetically, but this probably could be done faster, since multiple sorts of small sequences are faster then a single sort of their sum (here also the output will be probably much smaller than the input – on my system running the above pipe on /usr/share/man gave only about one third of lines of the output of find /usr/share/man -type f, including extra blank lines). Programs like my ununiq find all repeated lines of an unsorted input, but it does not support the options necessary for this task.

Porting ununiq to OpenSolaris

| No Comments | No TrackBacks

In previous week I began writing ununiq – a program listing unique lines of its output, but unlike uniq using a hash table and treating also nonadjacent lines as duplicates. Today I decided to check if it will compile on an operating system not using the GNU C library – OpenSolaris.

Using virtualization software like KVM or Virtual Machine Manager it is easy to access several free operating systems at once, so (after some networking-related voodoo) I installed OpenSolaris 2009.06 on a virtual machine without any problem. After installation I found that I cannot log in to the system with empty password (I do not see any security advantage of a password on such virtual machine), so I installed it again with some trivial passwords. I called this new system Laurelin.

Then on Laurelin I installed GNU Autoconf and Automake in newest available versions, i.e. several releases older than on Gentoo. GLib also was trivial to install. Then I tried to use autoreconf, but it didn’t work, since unlike Gentoo OpenSolaris does not link aclocal-1.10, automake-1.10, etc, to files without version numbers. Most probably there are better solutions, but for this one use it was easier to just call these programs directly.

This showed the first problem – other systems do not have newest Autotools – so packages should support the elder ones or provide files generated by newer versions, like configure, Makefile.in and many others. In this case it was easier and probably better to remove unnecessary Automake options introduced in 1.11 and allow Autoconf 2.61.

Then running ./configure showed another problem – the shell found syntax errors in unexpanded M4 macros. This was caused by missing pkg-config, since my package did not include its macros and it wasn’t required by GLib in the OpenSolaris package system. The solution was to install pkg-config. Such problems make source-based operating system distributions more appropriate for programming, since they need such packages to install GLib.

The linker found the next problem, lack of the getline() function. My configure script did not check for it, since it wouldn’t do anything with this. Although this function is included in the newest POSIX, GNU C library treats it as a GNU extensions and other C libraries do not support it. The solution was to check how bad the portable alternatives are and choose one of them.

Therefore I added to the package an implementation of getline() from Gnulib. Then the package compiled successfully on both OpenSolaris and Gentoo.

A test suite checked if this package works correctly with a small subset of its possible options and only one input file. Porting software to other systems should begin with writing a more complete test suite, but this still was enough to improve some parts of the program.