Recently in LaTeX Category

Automatic cross references in LaTeX and their problems

| 2 Comments | No TrackBacks

Most non-fiction texts have numbered sections and many references to them, sometimes stating also which page is referred to. Using LaTeX numbering of pages, sections and all such references is completely automatic, making these numbers nearly always correct. However, the method by which it is implemented is not completely optimal.

It is used by writing the \label{id} command in section to be referred, where id identifies the section, preferably being easy to remember and not changed too often. This makes it possible to use the \ref{id} and \pageref{id} commands which typeset the number of the section or its page number. (References may lead to page of any text or a number of equation, list item, theorems, etc; I refer to all of them as ‘sections’ in this post.)

During the first run of LaTeX the text is completely typeset, using ‘??’ instead of numbers to be referred to. All \labels write their identifiers, section and page numbers to the .aux file. On the beginning of the second run this file is read, then all references are used as they were correct on the previous run, and new values are written to the file.

Here an important feature of TeX is seen – typesetting words, breaking paragraphs into lines, joining lines into pages, and outputting pages are done asynchronously. To know the page number of a given \label LaTeX uses the primitive TeX command \write which evaluates appropriate commands during page output. This makes it impossible to change text depending on current page number, so the number from previous run must be used instead.

The same method is used for other things, like tables of contents, bibliographic references, indices and correct placement of margin notes on two-sided documents by the mparhack package.

However, using multiple passes for cross references has several disadvantages. The most visible one is that the time needed to make correct output is several times larger, although only one output file is needed and error messages are useful for only one pass (this can be improved by using \batchmode for non-first runs of LaTeX and pdfTeX’s -draftmode option for non-last runs). Despite this, it is clearly visible that most of work during non-last passes is unnecessary (especially when referring only to numbers of sections, not pages).

Since the same files are modified in each pass, it is difficult to optimally use make or another generic build system with LaTeX. This leads to longer processing than necessary and uncertainty of the document having outdated references.

It is even possible to make a document which has always incorrect references. This document shows this problem:

\documentclass{minimal}

\pagestyle{empty}
\pagenumbering{roman}

\setlength{\textwidth}{8pt}
\setlength{\textheight}{10pt}
\setlength{\parindent}{0pt}

\makeatletter
\@ifundefined{pdfpagewidth}{}{%
  \pdfpagewidth=2in
  \pdfpageheight=2in
}
\makeatother

\begin{document}
\setcounter{page}{9}
\pageref{x}\hspace{0pt}i\label{x}
\end{document}

(The part with \pdfpagewidth is to make it easier to see both pages at once in a PDF viewer, I have described it in a previous post.)

Since the second pass, this document will oscillate between having one or two pages. When the reference leads to page x, then the ‘i’ is on page ix, but with reference to page ix the ‘i’ is on page x. (Leslie Lamport states in LaTeX: A Document Preparation System that using Roman numerals may lead to this problem, I did not know any specific example of such document before writing the above one.)

So despite being very useful, automatic cross references in LaTeX have some disadvantages. Usually a good enough solution is to run LaTeX on a document some times longer than possible necessary, and change text to avoid having infinite loops in this process. Could it be improved? I’ll write about some other ways to avoid these problems in a separate post.

Making PDFs of correct paper size with LaTeX

| No Comments | No TrackBacks

One of the nicest features of LaTeX is having good layout by default. For many English uses the standard document classes are appropriate, for some non-English languages there are good adaptations for local typographic conventions (like mwcls for Polish texts). As necessary for internationalized use, they support many paper sizes. Unfortunately, if a PDF is generated from such document, it is treated by viewers as a Letter or A4 page even if it is completely different.

The UK TeX FAQ explains this problem and suggests several solutions – all of which are packages or document classes using their own page layouts. Since the document classes which I use already have appropriate layout, I decided to solve this problem differently.

As stated in the FAQ and the pdfTeX manual (page 20, the texdoc pdftex command in TeXLive shows this documentation), the \pdfpagewidth and \pdfpageheight dimensions are used to set the PDF page size. LaTeX stores the same data in \paperwidth and \paperheight (classes.pdf page 3 as available in TeXLive). Manuals of XeTeX and LuaTeX show that these TeX implementations also support these commands, so using any of them the following code will set the appropriate page size:

\pdfpagewidth=\paperwidth
\pdfpageheight=\paperheight

This code won’t work with TeX implementations without these commands. When these commands are undefined it will produce some error messages and output some text. The simplest solution is to not use these commands when they are not available, for example with the following code:

\makeatletter
\@ifundefined{pdfpagewidth}{}{%
  \pdfpagewidth=\paperwidth
  \pdfpageheight=\paperheight
}
\makeatother

The UK TeX FAQ suggests using the ifpdf package for detecting PDF support which here would be too specific – support for XeTeX would require using another package, also it is not a problem to use the above code when making a DVI file. It would not make DVIs with correct page size, but setting this would depend on the DVI driver and PDF-producing TeX variants have other useful advantages.

Formatting dates in LaTeX with Babel

| No Comments | No TrackBacks

A nice thing about LaTeX is support for automatic date formatting on article’s title pages using the \today macro. It even supports many languages (using Babel or completely language-specific packages). Unfortunately, (as the name suggests) it typesets only the current date, while all other must be written ‘by hand’.

This would not be appropriate way of writing e.g. certificates of attendance for a Polish IB school, where on each page many dates chosen by the school are written in both British English and Polish languages. I wanted to write only a single date in a simple format, like 2010/02/24.

The first solution which I used was a complicated package parsing this date format into separate macros for year, month and day (it also removed leading zeros), and then typesetting them using separate macros for different languages, with 12 macros per language for month names. It had at least the following problems:

  • difficult to understand or maintain macros
  • many per-language macros
  • the output was very similar to the \today macro.

Therefore I have rewritten this package today, using the \today macro after changing the ‘current’ date.

TeX stores the current date in three parameters: \year, \month and \day (The TeXbook, page 273). This trivial macro changes them using the 2010/02/24/ formatted date:

\def\@setcurrentdate#1/#2/#3/{%
  \year=#1%
  \month=#2%
  \day=#3%
  \relax
}

The interface of this macro is strange, but it looks like a simple way of separating parts of the input date. The date-formatting macro will expand its second argument and add the trailing slash (so the whole day will be used, instead of only its first digit) by \expandafter\@setcurrentdate#2/.

Babel provides at least two macros for changing languages: \foreignlanguage and \selectlanguage. The first one has more appropriate interface, but does not change the date format (as stated on page 6 of the Babel manual), so I used the second one.

The whole macro formatting the date has two arguments – language name as used by Babel and the date in slash-separated format. The macro is:

\newcommand{\Date}[2]{{%
    \selectlanguage{#1}%
    \expandafter\@setcurrentdate#2/%
    \today
  }}

(The name is capitalized, since I use it only in some document classes.) I’m not sure if specifying language language for each date is useful (dates are usually in the same language as surrounding text), but it could be trivially removed in useful packages.

I haven’t seen this way of extending LaTeX macros in other code, but it looks like an useful advantage of mutable parameters for ‘constants’ like current time.

Difficulties of typesetting quote marks in LaTeX

| No Comments | No TrackBacks

Probably the most complicated to typeset punctuation marks used in English are the quote marks. Although they should be used for short and simple quotations and other simple fragments of text, they are designed for more arcane uses. This combined with the influence of typewriters makes typesetting them difficult.

Quote marks are used exactly like parentheses – they delimit a fragment of a sentence. But unlike all other such characters, inner quotation marks are different symbols than the outer ones (unless larger outer delimiters in mathematical formulas count as different symbols (they are the most ‘mainstream’ use of parentheses in parentheses)). Another difference is that ‘((’ is easily interpreted in correct way, while ‘“ needs additional spacing (‘ “).

Another problem is that each language has different quote marks. American English uses double outer quotes and inner single quotes, British English uses them as inner and outer, Polish has low double opening quote and English double closing quote, the inner quotes are the French ones, although they are rarely used correctly. American English also includes following commas and periods in the quotes, obviously this would lead to problems in programming-related texts.

LaTeX does not solve these problems, but allows direct specification of appropriate symbols. The English quote marks are represented as ``, '', ` and ', since these are the nearest equivalents on a typical keyboard. The " character is not used and the space between quotes must be specified as \, (it could be specified in the font, but this would require separate sets of fonts for American and British texts, I’m not sure if it could support third level nested quotes in any of these dialects).

It would be interesting to use just the " character and let the software decide which quotes are opening and which are closing. But even without support for nested quotations this would be difficult (if possible) to do correctly in all cases. A naïve algorithm would just begin with an opening quote and then cycle between closing and opening ones. But this won’t interpret correctly quotes in multi-paragraph dialogue, where each paragraph begins with an opening quote (in Polish dashes are used instead of quotes for dialogue and there is no possibility for humans to interpret multi-paragraph dialogue correctly without backtracking). A common mistake in delimiting block quotations with quote marks may result in a paragraph containing only a closing quote mark, so this algorithm cannot be improved by just resetting to opening quote at each new paragraph.

Emacs uses a different algorithm in the TeX-insert-quote function. It puts opening quotes after whitespace or opening parenthesis. This method could not be implemented in LaTeX, but it can be done in language-specific fonts. But this algorithm fails when quoting spaces or parentheses, like ‘(’, which is commonly done in programming-related texts.

The only problem which can be easily solved is which quotes to use. I have written a LaTeX package for this, named quoted (available in my Mercurial repository), but it does not support spaces between quote marks of different levels or moving punctuation to the quotation. There are probably many better packages for this, but this will not make a useful document ‘portable’ between e.g. British and American dialects of English, so such packages aren’t very useful.

Since parentheses are similar to quotes, but simpler, maybe a single character in source files could be used for them. In times of typewriters a slash was sometimes used instead of parentheses, since it looks similar. Is it possible to implement a LaTeX macro or virtual font replacing / by a slash or appropriate parenthesis depending on context?

Making dashes from hyphens in LaTeX

| No Comments | No TrackBacks

Probably many users of LaTeX (including me) learned that dashes and hyphens look differently from texts about LaTeX. Many people, supported by keyboards limited to ASCII with some national and unused characters, write only hyphens, with various spacing around them, instead of dashes. Could a LaTeX user just include such text in their document and have correctly distinguished hyphens and dashes in the output? This post describes an attempt in this direction.

LaTeX already uses the ASCII hyphen character for both hyphens, minuses and dashes. If it is used in math mode, then it is a minus. Otherwise, - becomes a hyphen, -- an endash and --- becomes an emdash. The difference between endashes and emdashes lays only in their appearance, different languages require different ones with different spacing. This is the reason why I wrote the onedash package providing a single command, \dash, for typesetting the correct dash in the language and style of the document.

My new package, hyphdash, makes a dash or hyphen from a single hyphen with correct spacing. Both it, onedash and quoted (an equivalent of onedash for quotes) are available in my Mercurial repository. They are licensed under the GNU General Public License, version 3 or later.

Hyphens have two uses – they appear in compound words and in words divided across lines. Fortunately, the second use is done automatically by LaTeX and does not affect writing the package. Compound words do not have any spaces before the hyphen, but in lists like ‘mono- and polycrystals’ they may be followed by a space.

Dashes are sometimes surrounded by equal spaces – like in the British style used on this blog – or without spaces—like in the American style used in this sentence—or by unequal spaces. Usually the left space is unbreakable. This package assumes that dashes are surrounded by any normal spaces, i.e. input characters interpreted as spaces by TeX. (Unbreakable spaces appear more arcane than dashes or even inner quote marks, so they are unsupported in input by this package, but the output will have them.) TeXnical reason for this will be stated later.

My package does nearly all of its work in a macro which the hyphen made active character is defined to (see the packages’ README file of information about using this package). This expands to \relax followed by a normal hyphen if math mode is used (the \relax is probably useful in tables). The same result is obtained in horizontal mode if the current font have nonpositive space stretch parameter which probably occurs only for typewriter fonts.

In vertical mode a special dash is used, useful for representing dialogue in Polish texts (English use quote marks for this, making it easier to determine where a multiparagraph speech ends, and making inner quotes common). Probably no word begins with a hyphen, so this is used. Unix and GNU programs have commandline options beginning with a hyphen, but they are typeset in typewriter type (so there is a special case for it).

The complex part lays in the horizontal mode. Hyphens do not have leading spaces, so the are made if \lastskip does not contain positive value. So the common incorrect form of dash, alpha- beta will be kept as a hyphen. In other cases a dash is made. This ignores the possibility of having numeric ranges with endashes, like ‘69–105’. Instead, 69-105 will use a hyphen and 69 - 105 will use a much more incorrect (but more probable to be included in the input?) dash. Detecting digits before a hyphen is impossible without making all characters active (this could work only for verbatim typesetting of files, but this does not need dashes), and detecting them after the hyphen won’t distinguish such cases as ‘69–105’ and ‘2-chloro-3-methylpentane’.

There is also another problem – dashes represented by multiple hyphens. The first one will be made a dash, but the following ones (since there is no preceding space) will become hyphens. The standard ligatures for dashes will not be used. Maybe the macro could detect following hyphens, but ‘simple’ solutions like \@ifnextchar used for optional parameters will not work, since the hyphen is a complicated macro instead of a character. Changing catcodes (e.g. redefining the hyphen to a character) will not work with texts changing catcodes. The first hyphen could set a conditional to ignore following ones, but it would be difficult to change it to not ignore hyphens in the next dash. The rest is probably more difficult that these solutions. Therefore only a single hyphen may be used with this package as a dash. The macros \textendash and \textemdash may be used instead of multiple hyphens to make a dash character.

Another problem are hyphens used as minuses for numbers interpreted by TeX. For example, \hspace{-1em} will produce strange results and two error messages. In my opinion the only solutions to this are to not use a hyphen in arguments of such commands (e.g. by using the \hyphen macro or by putting all such things into the preamble), or to redefine each primitive TeX command to change the macro making dashes and hyphens (probably it is impossible to detect where such changes should be made). It is obvious why the first solution is used in the package.

This package uses also an example document as an automatic test to find regressions in future versions (I’ve written previously about this in other packages for dashes and quotes). From the example I learned about most of the limitations of this package described here.