November 2009 Archives

Comparing performance of common Unix shells

| No Comments | No TrackBacks

Probably most programs written by a user of a GNU/Linux operating system are scripts interpreted by programs inspired by the Bourne shell. Although most of their work is either interactive (so most probably faster than a human can see) or done by efficient C programs, it would be interesting to compare how the choice of shell affects the time needed to run some scripts.

Most GNU/Linux systems use Bash as their only shell. This is different on BSD derivatives like FreeBSD using ash for scripting and tcsh (a C shell derivative with largely different syntax than other shells) for interactive use.

Most shell scripts do not use specific features of any shell and need just a mostly-POSIX-compatible shell like dash (an ash derivative) or Bash. Therefore they specify /bin/sh as their interpreter, which is always such shell. In most GNU/Linux distributions Bash is used as /bin/sh, while in BSDs ash is used, and Ubuntu and Debian Squeeze use dash. Therefore many scripts using Bash-specific features declare incorrectly to be used with the default shell and fail on Ubuntu or FreeBSD.

Avoiding the above problem by testing scripts with shells having only the features required by POSIX is not the only reason to use non-Bash shells for scripting. The dash shell is faster then Bash, this is why it was proposed for Debian Lenny release to use dash as the default shell for scripts.

To check how time performance of different shells differs, I wrote several trivial scripts which can be interpreted by the popular POSIX-like shells. Two of the scripts calculate factorials using different recursive algorithms (one is the ‘standard’ definition used in mathematical textbooks, the other one is the tail-recursive one used in functional programming textbooks), another one calculates elements of the Fibonacci sequence using the recursive definition, the fourth one just calls the shell about one hundred times to check how slow is its initialization. I haven’t seen a real shell script doing such things, but the ones which I normally use depend mostly on other program performance or use Bash-specific features. Another script calculates average time spent by each script and shell combination from ten runs (one additional run of each is done before counting, since this needs loading the shell from the disk) and outputs the result in a simple to parse format.

I compared six shells available in Gentoo GNU/Linux ebuilds sys-apps/busybox-1.15.2, app-shells/bash-4.0_p35, app-shells/dash-0.5.5.1.2, app-shells/mksh-39, app-shells/pdksh-5.2.14-r4, app-shells/zsh-4.3.10. The average times in seconds on the machine which I’m using calculated by the script are:

Scriptbbdashbashzshmkshpdksh
tail-recursive factorial0.120.0830.2540.230.1220.117
standard factorial0.1060.0840.2290.2420.120.121
Fibonacci sequence1.0610.8012.1772.0631.0441.301
recursive shell invocation0.3410.2690.5151.910.3780.349

For all above tests dash is the fastest, BusyBox and Korn shell variants have similar performance, while Bash or zsh is the slowest one. Bash was two to three times slower than dash for these tests.

Of course, real scripts are something completely different. Probably everyone who wants to write functional programs knows more appropriate languages than POSIX shells. Also, extensions of many shells probably might make them faster for some scripts using them. The main reason for shell scripting is the ease of writing trivial scripts similar to commands written for daily interactive use. Therefore it is more useful to write a simple script and rewrite it in a better language when needed.

The scripts used for the above calculations are available in my Mercurial repository. The main script is licensed under the GNU General Public License, version 3 or later, while the tested scripts are public domain, since I hope that these are too unoriginal to be copyrightable.

Some limitations of popular free Web log analyzer software

| 2 Comments | No TrackBacks

It is useful for a blogger to know how their site is used. Understanding which information the users are searching for, which sites linked them to it, relationship between post’s popularity and weekday, might help making more useful content. But getting such information should not harm the users, i.e. not increase the amount of useless scripts which they must download and not waste time which the blogger might use to write useful texts or to communicate with others.

Sources of data

Most Web servers store in their access logs some data about each request, like the user’s IP address or the referring page URL. There are many formats of such data, but they all share three important things:

  • no additional work is done client-side
  • only data specified in the HTTP headers is used
  • all accesses are logged, including these from robots.

The problem with such data is that it does not specify some information known only by the user’s browser (called more formally an user agent), like screen resolution, support for JavaScript or some useless plugins. Other information coming from the user are trivially forged, malicious bots happily pretend to be real browsers coming by links from other pages.

A partial solution to this is to use JavaScript code and zero-sized images without caching to get these information from the client. But this requires more requests per page view (especially from different servers, these makes page loading much slower), and it ignores users who disable JavaScript or use browser extensions blocking such code for privacy/performance reasons.

Although there is nothing specific to free software, this situation leads to many problems with programs analyzing Web server logs.

Some uses of the data

These are several possible uses of data stored in the access logs:

  • finding which topics are popular and worth expansion
  • comparing posts with search keywords leading users to them, maybe they could be more useful for common visitors
  • blocking access for bots which do not benefit potential users and waste bandwidth
  • finding other blogs linking to the site, they might have useful information on similar topics
  • comparing effectiveness of different posting schedules
  • finding possible problems, like broken incoming links
  • determining how specific browsers or operating systems are popular among the readers

All of these might be used to make the site more useful. The programs should make it easy, but it is not as simple as it seems.

How spam makes it difficult

For most uses only data about human visitors is helpful. Only to block unfriendly bots or to correct technical problems data about bot visits is needed.

The problem is that only the useful bots want to be identified as bots. The ones which send spam, copy content to spam sites, get mail addresses to send spam, spam etc, do not want to be known – this would make it trivial to disallow their visits. So they pretend to use popular Web browsers and use many IP addresses without any clear pattern.

Many spam bots can be easily identified by using identifications of very old browsers (some of which could not access the site due to changes in the Web protocols), or by strange usage patterns like visiting only a single page referring from the same page and not getting any styles or images. They also go to URLs used by insecure Web applications and pretend to visit from certain sites in hope of getting a link to these sites (it is called referrer spam). This spam is useless in most cases, since the referrer URLs are not published on properly written sites excluding ones like password-protected log analyzer reports (with all links marked to be ignored by search engine crawlers). But it still makes the log analyzers less useful.

Problems of common log analyzers

One of the most visible things which I observed after visiting the Wikipedia list of Web log analyzers is that most of them are very old. Of the ones not using MySQL or PHP one had last release in 2004, another does not try to ignore visits by bots in statistics generated, using another one is the main inspiration for this post.

Clearly, identification of new browsers and operating systems, proper determination of queries from new (or renamed) search engines, and detection of malicious bots requires changes in software. So I believe that projects without new releases in this year do not detect new things and have problems making them less interesting to improve.

Another problem is that URLs are usually not unique for a given content, although they should. This is most common with forum software written in PHP, they use different URLs for each user. Therefore log analyzers treat each visit from a forum thread as a visit from a different page. This makes lists of referring URLs much less friendly to humans who are more interested in pages than their specific URLs.

There are probably no perfect solutions for the spam in statistics, but the programs could vastly decrease its amount by trivial measures.

Solutions

There are two methods of solving these problems – correcting an existing program or writing a new one. Since most of free software log analyzers are written in C, which is better for much different programs, or Perl, which is appropriate for much smaller programs and probably encourages committing some of their possible design mistakes, it would be difficult for me. Maybe it would be an interesting learning experience to write another faulty log analyzer?

liability-deltoid