Automatic cross references in LaTeX and their problems

| 2 Comments | No TrackBacks

Most non-fiction texts have numbered sections and many references to them, sometimes stating also which page is referred to. Using LaTeX numbering of pages, sections and all such references is completely automatic, making these numbers nearly always correct. However, the method by which it is implemented is not completely optimal.

It is used by writing the \label{id} command in section to be referred, where id identifies the section, preferably being easy to remember and not changed too often. This makes it possible to use the \ref{id} and \pageref{id} commands which typeset the number of the section or its page number. (References may lead to page of any text or a number of equation, list item, theorems, etc; I refer to all of them as ‘sections’ in this post.)

During the first run of LaTeX the text is completely typeset, using ‘??’ instead of numbers to be referred to. All \labels write their identifiers, section and page numbers to the .aux file. On the beginning of the second run this file is read, then all references are used as they were correct on the previous run, and new values are written to the file.

Here an important feature of TeX is seen – typesetting words, breaking paragraphs into lines, joining lines into pages, and outputting pages are done asynchronously. To know the page number of a given \label LaTeX uses the primitive TeX command \write which evaluates appropriate commands during page output. This makes it impossible to change text depending on current page number, so the number from previous run must be used instead.

The same method is used for other things, like tables of contents, bibliographic references, indices and correct placement of margin notes on two-sided documents by the mparhack package.

However, using multiple passes for cross references has several disadvantages. The most visible one is that the time needed to make correct output is several times larger, although only one output file is needed and error messages are useful for only one pass (this can be improved by using \batchmode for non-first runs of LaTeX and pdfTeX’s -draftmode option for non-last runs). Despite this, it is clearly visible that most of work during non-last passes is unnecessary (especially when referring only to numbers of sections, not pages).

Since the same files are modified in each pass, it is difficult to optimally use make or another generic build system with LaTeX. This leads to longer processing than necessary and uncertainty of the document having outdated references.

It is even possible to make a document which has always incorrect references. This document shows this problem:

\documentclass{minimal}

\pagestyle{empty}
\pagenumbering{roman}

\setlength{\textwidth}{8pt}
\setlength{\textheight}{10pt}
\setlength{\parindent}{0pt}

\makeatletter
\@ifundefined{pdfpagewidth}{}{%
  \pdfpagewidth=2in
  \pdfpageheight=2in
}
\makeatother

\begin{document}
\setcounter{page}{9}
\pageref{x}\hspace{0pt}i\label{x}
\end{document}

(The part with \pdfpagewidth is to make it easier to see both pages at once in a PDF viewer, I have described it in a previous post.)

Since the second pass, this document will oscillate between having one or two pages. When the reference leads to page x, then the ‘i’ is on page ix, but with reference to page ix the ‘i’ is on page x. (Leslie Lamport states in LaTeX: A Document Preparation System that using Roman numerals may lead to this problem, I did not know any specific example of such document before writing the above one.)

So despite being very useful, automatic cross references in LaTeX have some disadvantages. Usually a good enough solution is to run LaTeX on a document some times longer than possible necessary, and change text to avoid having infinite loops in this process. Could it be improved? I’ll write about some other ways to avoid these problems in a separate post.

Making PDFs of correct paper size with LaTeX

| No TrackBacks

One of the nicest features of LaTeX is having good layout by default. For many English uses the standard document classes are appropriate, for some non-English languages there are good adaptations for local typographic conventions (like mwcls for Polish texts). As necessary for internationalized use, they support many paper sizes. Unfortunately, if a PDF is generated from such document, it is treated by viewers as a Letter or A4 page even if it is completely different.

The UK TeX FAQ explains this problem and suggests several solutions – all of which are packages or document classes using their own page layouts. Since the document classes which I use already have appropriate layout, I decided to solve this problem differently.

As stated in the FAQ and the pdfTeX manual (page 20, the texdoc pdftex command in TeXLive shows this documentation), the \pdfpagewidth and \pdfpageheight dimensions are used to set the PDF page size. LaTeX stores the same data in \paperwidth and \paperheight (classes.pdf page 3 as available in TeXLive). Manuals of XeTeX and LuaTeX show that these TeX implementations also support these commands, so using any of them the following code will set the appropriate page size:

\pdfpagewidth=\paperwidth
\pdfpageheight=\paperheight

This code won’t work with TeX implementations without these commands. When these commands are undefined it will produce some error messages and output some text. The simplest solution is to not use these commands when they are not available, for example with the following code:

\makeatletter
\@ifundefined{pdfpagewidth}{}{%
  \pdfpagewidth=\paperwidth
  \pdfpageheight=\paperheight
}
\makeatother

The UK TeX FAQ suggests using the ifpdf package for detecting PDF support which here would be too specific – support for XeTeX would require using another package, also it is not a problem to use the above code when making a DVI file. It would not make DVIs with correct page size, but setting this would depend on the DVI driver and PDF-producing TeX variants have other useful advantages.

Formatting dates in LaTeX with Babel

| No TrackBacks

A nice thing about LaTeX is support for automatic date formatting on article’s title pages using the \today macro. It even supports many languages (using Babel or completely language-specific packages). Unfortunately, (as the name suggests) it typesets only the current date, while all other must be written ‘by hand’.

This would not be appropriate way of writing e.g. certificates of attendance for a Polish IB school, where on each page many dates chosen by the school are written in both British English and Polish languages. I wanted to write only a single date in a simple format, like 2010/02/24.

The first solution which I used was a complicated package parsing this date format into separate macros for year, month and day (it also removed leading zeros), and then typesetting them using separate macros for different languages, with 12 macros per language for month names. It had at least the following problems:

  • difficult to understand or maintain macros
  • many per-language macros
  • the output was very similar to the \today macro.

Therefore I have rewritten this package today, using the \today macro after changing the ‘current’ date.

TeX stores the current date in three parameters: \year, \month and \day (The TeXbook, page 273). This trivial macro changes them using the 2010/02/24/ formatted date:

\def\@setcurrentdate#1/#2/#3/{%
  \year=#1%
  \month=#2%
  \day=#3%
  \relax
}

The interface of this macro is strange, but it looks like a simple way of separating parts of the input date. The date-formatting macro will expand its second argument and add the trailing slash (so the whole day will be used, instead of only its first digit) by \expandafter\@setcurrentdate#2/.

Babel provides at least two macros for changing languages: \foreignlanguage and \selectlanguage. The first one has more appropriate interface, but does not change the date format (as stated on page 6 of the Babel manual), so I used the second one.

The whole macro formatting the date has two arguments – language name as used by Babel and the date in slash-separated format. The macro is:

\newcommand{\Date}[2]{{%
    \selectlanguage{#1}%
    \expandafter\@setcurrentdate#2/%
    \today
  }}

(The name is capitalized, since I use it only in some document classes.) I’m not sure if specifying language language for each date is useful (dates are usually in the same language as surrounding text), but it could be trivially removed in useful packages.

I haven’t seen this way of extending LaTeX macros in other code, but it looks like an useful advantage of mutable parameters for ‘constants’ like current time.

Using a blog software without server-side scripting

| No TrackBacks

The software previously used on this blog (Zine) keeps the text in a relation database (in this case PostgreSQL) and on each request formats a page using this data and some Python code. Most popular blog software use exactly the same paradigm, although most are not written in Python and many use only MySQL for the database.

The problems with this solution is that the same pages are generated many times and much simpler software could be used for this. On a typical blog updated much less often than viewed exactly identical pages are generated multiply times, using multiply times more resources than necessary. Therefore hosting of typical blogs could be much simpler and cheaper than is possible with such technology.

Therefore very large sites (or sites using very large software) use separate caching servers, like Varnish. Such programs get a page from the original server and keep it for some time, giving it much faster without regenerating the page for next requests. This solution still does not support nicely sites changed for every user (so usually the cache is skipped for 'non-anonymous' users) and it is difficult to avoid giving outdated pages from the cache (at least while the cache is used for unchanged pages). (Another problem is that another daemon must by running on the server, allowing friendly 503 HTTP errors when one of the daemons serving the page does not work.)

All problems of such caching servers could be avoided by correctly generating static files for each page when something is changed. If the files will be generated on a different machine than the server, then it could be written in completely different, maybe better, ways than the software used on the server. The pages will be uploaded to the server and a simpler HTTP server would send them much faster than with any other solution.

This solution would clearly require generation of all uploaded files by a single user with access to the whole site. So there won't be any multiuser things, there will be no search, and the sites will not depend on current time (this is used for relative, friendly dates in texts like 'written five hours ago', this could be easily done client-side using JavaScript). Still, these things look uncommon on a typical blog written by a single user.

The problem is that pages of a simple blog depend usually not only on their own content and other posts but also on comments posted by users. A server-side script is necessary to get the comments, but it won't be a problem since this is exactly what such scripts are for. There are two possible things to do with the comments obtained by the script - adding them to the post page or putting them in a private place from which the user would move the comments to be published on next update. The first solution requires making the page 'less static', but the server-side code would be still much simpler then usually. The second solution is useful also due to useless spam (ignored by both readers, search engines and writers) being sent by malevolent bots as comments.

Therefore such 'offline' blog software would work well enough for small sites. They would be also able to do things which are too slow to be done by advanced server-side scripts, for example checking if the example source code written in a post about the C programming language can be compiled (and maybe even run and generate the output shown in the post). Such software improvements could make higher quality posts easier. Also, as a locally used program, it could be more user-friendly than Web-based solutions (it could e.g. allow using standard Unix tools to correct a typo in many posts, or use more helpful editors than available in a browser).

Maybe it is worth writing such a program (or finding and using an existing one). Certainly the possibility of converting an existing blog to use such software would not be trivial (e.g. to avoid duplicate entries in feed readers and to import all useful data like comments), but it might have more benefits.

Comparing performance of common Unix shells

| No TrackBacks

Probably most programs written by a user of a GNU/Linux operating system are scripts interpreted by programs inspired by the Bourne shell. Although most of their work is either interactive (so most probably faster than a human can see) or done by efficient C programs, it would be interesting to compare how the choice of shell affects the time needed to run some scripts.

Most GNU/Linux systems use Bash as their only shell. This is different on BSD derivatives like FreeBSD using ash for scripting and tcsh (a C shell derivative with largely different syntax than other shells) for interactive use.

Most shell scripts do not use specific features of any shell and need just a mostly-POSIX-compatible shell like dash (an ash derivative) or Bash. Therefore they specify /bin/sh as their interpreter, which is always such shell. In most GNU/Linux distributions Bash is used as /bin/sh, while in BSDs ash is used, and Ubuntu and Debian Squeeze use dash. Therefore many scripts using Bash-specific features declare incorrectly to be used with the default shell and fail on Ubuntu or FreeBSD.

Avoiding the above problem by testing scripts with shells having only the features required by POSIX is not the only reason to use non-Bash shells for scripting. The dash shell is faster then Bash, this is why it was proposed for Debian Lenny release to use dash as the default shell for scripts.

To check how time performance of different shells differs, I wrote several trivial scripts which can be interpreted by the popular POSIX-like shells. Two of the scripts calculate factorials using different recursive algorithms (one is the ‘standard’ definition used in mathematical textbooks, the other one is the tail-recursive one used in functional programming textbooks), another one calculates elements of the Fibonacci sequence using the recursive definition, the fourth one just calls the shell about one hundred times to check how slow is its initialization. I haven’t seen a real shell script doing such things, but the ones which I normally use depend mostly on other program performance or use Bash-specific features. Another script calculates average time spent by each script and shell combination from ten runs (one additional run of each is done before counting, since this needs loading the shell from the disk) and outputs the result in a simple to parse format.

I compared six shells available in Gentoo GNU/Linux ebuilds sys-apps/busybox-1.15.2, app-shells/bash-4.0_p35, app-shells/dash-0.5.5.1.2, app-shells/mksh-39, app-shells/pdksh-5.2.14-r4, app-shells/zsh-4.3.10. The average times in seconds on the machine which I’m using calculated by the script are:

Scriptbbdashbashzshmkshpdksh
tail-recursive factorial0.120.0830.2540.230.1220.117
standard factorial0.1060.0840.2290.2420.120.121
Fibonacci sequence1.0610.8012.1772.0631.0441.301
recursive shell invocation0.3410.2690.5151.910.3780.349

For all above tests dash is the fastest, BusyBox and Korn shell variants have similar performance, while Bash or zsh is the slowest one. Bash was two to three times slower than dash for these tests.

Of course, real scripts are something completely different. Probably everyone who wants to write functional programs knows more appropriate languages than POSIX shells. Also, extensions of many shells probably might make them faster for some scripts using them. The main reason for shell scripting is the ease of writing trivial scripts similar to commands written for daily interactive use. Therefore it is more useful to write a simple script and rewrite it in a better language when needed.

The scripts used for the above calculations are available in my Mercurial repository. The main script is licensed under the GNU General Public License, version 3 or later, while the tested scripts are public domain, since I hope that these are too unoriginal to be copyrightable.