Recently in TeX Category

Are publications about TeX from 1980 still relevant?

| No TrackBacks

Most of the creative works made today won’t be used after the next five years. The most successful ones, like Mickey Mouse, will make the less successful ones disappear from our memories with the material on which they were made, due to copyright. Probably everyone who used a computer longer then ten years saw another reason why older works became nearly unknown – new, better works are made and replace the older ones in all their uses. I had written once about this process in the history of programming languages; for every useful software this should be observed. For every program, newer, better ones will be designed and replace it. But does this conjecture apply to TeX, the typesetting system designed and implemented in 1977–1990? Are the ideas presented in old articles about TeX still valid and relevant to current users?

This post examines only a part of the first issue of TUGboat, published in October 1980, the journal of the TeX User Group. Since this time TeX and METAFONT were rewritten in WEB (which, as Donald Knuth wrote, was designed in September 1981), LaTeX, ConTeXt and Texinfo macro packages were developed. Also printing changed completely, with PostScript introduced in 1984 and desktop publishing based on it.

The editor’s comments contain a nice description of how software helps writers – ‘TeX, AMS-TeX and METAFONT are software tools which will make the processing of scientific documents less painful, less expensive and more rapid. Authors working at editing terminals will find correcting easier and they will be spared much of the pain of proofreading’. Today ‘scientific documents’ and ’editing terminals’ changed, so this description also applies to WYSIWYG word processors. I don’t know how difficult it would be to prepare a mathematical paper using hot-metal technology, but I’m sure that it is simpler with TeX than with a WYSIWYG word processor with appropriate knowledge of both of them. It is nice to read an article from 1980s which assumed that users may read a manual in order to use a program.

Then Richard Palais described several problems related to implementing TeX in Pascal. Although there were no better alternatives to Pascal as a language for portable programs, the Pascal compilers were too incompatible for this. For this reason system dependent code in TeX was separated from the one working on all systems. Clearly portability is less important now – a few systems replaced all others in nearly all uses. Also, system dependent code is now usually shared by many programs. A program written in a high-level programming language (there are more of them then words in a typical blog post) can be run unmodified on every typical computer with every typical operating system. So the problem of writing portable programs may be considered completely solved.

After this Palais describes how to use TeX. It is nice to compare it with modern word processors. For an user of TeX it is easier to use styles designed by experts for the specific book or journal, then to design their own. This was clearly reversed by word processors, in combination with inferior defaults leading to a large decrease of typesetting quality. Then some performance numbers follow, showing how slow TeX was in 1980. Improvements in hardware and compilers changed this, but word processors are still too slow to use, unlike TeX which was useful even when producing a page took several seconds.

Then several details of the work of TeX are explained. Except for enlarged font limits, these did not changed in TeX82. It is stated that the output of TeX is device independent. Clearly, this is not true for some popular word processors or web browsers (Mozilla Firefox still prints text with glyph positioning optimized for screen). The following description of METAFONT does not state its advantages over more recently developed font formats like PostScrip Type 1, it states only that it allows fonts to be used on different devices.

Palais describes also an important advantage of typesetting papers by the authors – half of costs of journals result from the ‘activity of adding errors and then removing them’ which is done by typesetting the paper. Another cost related to a journal is storing it on paper and photocopying useful articles. Palais wrote that in future journals will be stored in electronic form and printed only when needed. For efficiency reasons, the articles would be stored in a device independent format. It’s a very useful idea, I have read about it in the electronic form of a journal article (screens now are good enough to avoid printing such articles). Clearly, electronic journals are implemented by the World Wide Web, thirteen years after the first issue of TUGboat was written. Although HTML on which the Web is based does not support ‘real’ mathematics, storing PDF files with articles (typeset by TeX) is a direct implementation of the idea.

These two texts describe both issues which had not changed and some which changed completely. TeX had improved and newer alternatives were made, but these articles still are useful. Next posts will examine more articles from this issue.

My previous post described the elements of a horizontal list in TeX. This one will describe the elements which are broken into pages and some improvements which should be now more possible than in 1980s when TeX was implemented.

The Chapter 15 of The TeXbook by Donald E. Knuth explains the page breaking algorithms of TeX and how they may be used to produce beautiful books. The first paragraph (page 109) states that page breaking is much more difficult than line breaking, since ‘pages often have much less flexibility than lines do’. Unlike line breaking, which uses the total-fit algorithm enabling optimal breaks of whole paragraphs, for page breaking a first-fit algorithm is used, so only the current page is ‘seen’ by TeX to select appropriate breaks. As Knuth explains (page 110), this design difference is based on the unavailability of enough high-speed memory to store several pages. This was certainly true in 1980s, but now many complete books fit in the modern equivalents of high-speed memory of elder days.

Both vertical and horizontal lists contain boxes, glue, kerns and penalties. I’ve described them previously, there are no interesting differences here except for the direction of typesetting. Whatsits and marks were explained in that post, since they are passed from horizontal lists to vertical.

There are two types of material occurring only in one of these two modes – discretionary breaks are only in horizontal mode, in vertical output routines do special tricks instead; insertions are used to put some material in special places of pages (most commonly footnotes, floating tables and figures). Discretionary breaks in vertical lists would probably simplify some things requiring complicated output routines, for example typesetting indices with the entry text repeated on pages beginning with subentries (a solution using marks is explain in The TeXbook, pages 261–263).

The output routine is one of the new ideas in TeX. It allows nearly arbitrary modifications of the page produced from the vertical list, to a box which is shipped out to an output file. Output routines allow things like multicolumn typesetting, special headers and footers, footnotes and correctly floating figures.

An output routine is so useful in vertical mode, so would something similar in horizontal mode be useful? Lines are just boxes of certain width and shift (chosen by e.g. \parshape), with special glue on both sides (to allow e.g. ragged-right typesetting) and content determined by the total-fit (pdfTeX also adds margin kerning). It would be interesting with an arbitrary TeX token list producing such boxes. It would probably make things like line counting or repeated opening quote mark simpler. It would also determine how nice the line is and possibly change it according to the number of previous lines. Is there a nice TeX solution for typesetting the first line of a paragraph in small caps? According to a TUG interview with Werner Lemberg it is simple in Troff. The ‘line routine’ would make it simple in a TeX-like system.

The line routine would determine the badness of a line, clearly ragged-right text has different optimal breaks than justified one (compare the broken LaTeX ragged text commands with normal justified text; use the ragged2e package instead). In vertical mode the badness of a page break is determined before calling the output routine, but it may decide to change the break. Wouldn’t this be simpler with an output routine called for each feasible break to determine the badness of this break?

There are two possible solutions to the problems of the current page breaking algorithm. One is a total-fit page breaking which would also make a typesetting system simpler (the same total-fit algorithm could be used for both lines and pages). The other one is a better cooperation between line breaking and page breaking (proposed at least once for the NTS, the project which led to e-TeX). Maybe if badness was calculated for a chapter as a whole, things like adjusting \looseness by hand to prevent bad page breaks would be automated in a way not possible with TeX?

Horizontal lists in TeX

| No TrackBacks

One of the most important typesetting ideas on which TeX is based is the box/glue/penalty model. It is used both to break paragraphs into lines, and to break lines into pages. Since these processes are similar, lines and pages have similar representations. The aim of this post is to describe how material of a paragraph is represented.

The TeXbook by Donald E. Knuth lists the elements of a horizontal list (the material which is broken into lines and put in horizontal boxes) in Chapter 14, page 94:

  • boxes
  • discretionary breaks
  • whatsits
  • vertical material
  • glue
  • kerns
  • penalties
  • math-on and math-off

Boxes do not need any explanation, they are the visible elements of texts, usually glyphs, rules or their combinations (e.g. a table is usually a box made from simpler boxes). Glue and kerns make whitespace between them. Discretionary breaks allow breaking lines in more complicated ways than just removing whitespace. Penalties control how bad the breaks are. These elements have clear use for the line breaking algorithm. They are the only elements of a horizontal list that I’ve directly met in LaTeX.

Math-on and math-off are the additional whitespace made by \mathsurround. They differ from kerns by not allowing breaking on glue or kerns inside math formulas. So in a new typesetting system they probably could be replaced by a kern and infinite penalties at appropriate places inside the formula.

Glue and kerns look similar (on paper they are the same, white areas between glyphs), but they have two main differences – glue is stretchable and separates words (for automatic hyphenation), while kerns do not change their size and make words unhyphenable. There are two types of kerns – explicit which are directly put by the \kern primitive and implicit which is completely automatic and do not affect hyphenation.

In all of the above differences between glue and kerns, explicit kerns look similar to empty boxes. But there are two important differences – boxes have also vertical dimensions (useful to make proper vertical spacing in tables) and they are not discardable, so a box cannot be removed on a page break while a kern is removed there (imagine a justified paragraph with empty boxes on beginnings or endings of lines, it would be ragged). This is a nice example of how different the elements of a horizontal list are – every one of them is useful, no one may be completely replaced by any other one.

Vertical mode material is put in a horizontal list to be placed between lines produced from the list. This may be used e.g. to put a page break after the current line when it is not known where the line ends. It is used also for marks which are token lists put in the page, the output routine (more on this later) will access some of them. Similarly, whatsits are used when a page is produced, but after the output routine. They are used to write page numbers to files (necessary to make an index), to make right to left text in e-TeX, and to give DVI drivers special commands, e.g. to change colour of text or to make a hyperlink.

Ideas for an implementation of a typesetting system

| No TrackBacks

I’ve written several posts about TeX as a program and things which could be simpler with a typesetting system based on a different design. The basic ideas are that it will be incompatible with TeX, will be an easily extendable package written in the Python programming language with the full power of the language available for typesetting. Then I wrote about font selection a problem at first glance unrelated to the previous ones.

I haven’t written any useful code for a typesetting system, but if it will be written it will be based on these ideas. Maybe some of them might be useful without this program, so I’ve written them here.

The program will be called Tim, since it is an easy to pronounce, short name, which can be positively associated with the name and could mean ‘TeX Inspired Modules’. The name clearly shows that art is not important in this project, quality and efficiency may suffer without any useful reason. It is also easier to write about a named program than an unnamed one.

The whole code of Tim will be written in Python and published under the GNU General Public License version 3 or later. Therefore every program using this package will use the same license. This encourages sharing and making documents which are not derivative works of Tim, i.e. writing them in a format which can be treated as data, not a program.

The need of separating code and date leads to another idea. Everything except the core model of horizontal and vertical lists (based on the one used in TeX) should be trivial to replace. Many algorithms for some things are necessary, e.g. for hyphenation different ones are used for different languages, different font formats are used, etc. So every useful algorithm will be implemented in a plugin, i.e. a function or object used by code independent of this implementation. A special module will determine which plugins are used for which tasks, probably it will store this data in files (one per system, one per user and one per document).

There will be different plugins for selection of fonts (determining which fonts is nearest to the requested one), for getting metric data from different font formats, for interpreting input text (e.g. to replace spaces by glue and other characters by glyphs, possibly with some transliteration schemes), breaking paragraphs into lines, hyphenation, making lines, making pages, etc.

The similarity of pages and lines is an interesting one. In TeX they are different – a line has some glue on both sides, pdfTeX supports also kerning with the margins; while a page is produced by entirely customizable output routine. Also, total-fit is used to make optimal line breaks while first-fit makes page breaks. This results from large amount of memory used for pages and large number of lines being made. But is it possible to e.g. typeset each line of a paragraph in a different font? This would be simple when the code making a line box from a part of a horizontal list may be easily replaced by one specific to this job (in this way marginal kerning could also be implemented). This would be like an output routine, but for a line.

In this model the line breaking algorithm (I don’t call it justification, since this name looks more appropriate for make a box from a line) just makes a linear list of line hboxes from a horizontal list. The page breaking algorithm makes a linear list of page vboxes from a vertical list. Elements like glue and penalties are used in both in the same way. So it would be simpler to use exactly the same code for both things. It would also improve page breaks unless a first-fit plugin will be implemented for this.

There are also more complicated things in a typesetting system. Input languages like a one very similar to the one of TeX, and possibly XSL could be nearly separate from the typesetting code. Similarly, code for typesetting mathematical expressions with TeX-like quality (i.e. code for conversion of math lists to horizontal lists) could be completely separate from text typesetting code, it would just be used by the interpreter of the input language.

Maybe this will be a useful project or some new ideas for future TeX extensions.

After several days break I still plan to learn more about digital typography by writing a typesetting system inspired by TeX. Previously I wrote why I believe that this system should not be compatible with TeX and how using a general purpose programming language for typesetting will help. This post is about a different issue – how fonts would be represented in this system. For simplicity, I will not write about problems specific for math typesetting; text has enough problems for a post.

As stated by Vulis the plain TeX model of fonts is inadequate for e.g. academic publications. There each font is a set of 256 characters which for TeX are just boxes of specific size, combined with different characters by ligatures and kerning. Although a specific font may be scaled, this model does not provide any support for using different styles and sizes of fonts. Therefore macros used for books written by Donald Knuth (e.g. in Appendix E of The TeXbook), GNU Texinfo (texinfo.tex) and LaTeX 2.09 use static tables of different font definitions in several styles for some sizes. This approach makes using different font families clearly difficult.

Therefore LaTeX2e uses a different model, called the New Font Selection Scheme. There a font has the following attributes (from LaTeX2e font selection, the file fntguide.pdf in a TeX distribution):

encoding
the mapping of character commands to 8-bit character numbers in TeX fonts; font encodings define also ligatures used
family
this is commonly known as a typeface or font
series
e.g. medium or bold
shape
e.g. italic, roman, slanted, caps and small caps
size
the size of one em

This is clearly appropriate for the original Computer Modern fonts (the default fonts in LaTeX, the only ones known to be available in every TeX distribution since 1980s), but now it has at least the following problems:

  • font encodings are a useless waste of time and hindrance for multilingual typesetting; I believe that Unicode will be enough for everybody (imagine that a list of all its characters would not fit in a typical book)
  • slanted (or italic) small capitals cannot be easily represented in this scheme; the package slantsc allows their use as a different shape, exactly what the scheme was designed to avoid
  • usually font size is artificially limited to avoid scaling them (this was a problem before scalable fonts or automatic generation of bitmap fonts by dvi drivers)

This is different with OpenType as used with e.g. XeTeX. There a font family has equivalents of series and only roman and italic shapes (making both italic and slanted roman fonts is too difficult without METAFONT). Things like small capitals or strange ligatures are enabled by features with the same font file. Clearly, this model does not have the problems listed above.

CSS3 fonts module working draft describes another set of font attributes. It has ‘correct’ style, a one for width (one font family for Antykwa Toruńska and Antykwa Toruńska Condensed would be nice), separate attribute for small caps, and much nicer support for relative font sizing than LaTeX.

But this is not everything that can be done with a font. For TeX only the metrics are important, but still other things cannot be easily expressed there. For example, coloured or underlined hyphenated text is very difficult to obtain in TeX. Colour clearly does not affect boxes (I’m not sure how underlining affects the depth of a box), so it could be determined after breaking the paragraph into lines. Currently systems like XeTeX have specific support for such things, but in my opinion a generic method for all changes to the fonts after a page is produced is possible. So in my system I would add a one new font attribute – a Python function processing the text when a page is shipped to the output file. It would add things like colour, outlines or underlining to the text (letterspacing, although solves similarly to underlining in the soul package, would need a completely different solution, but it will be trivial in a system with complete access to hyphenation and boxes). This would be similar to whatsits in TeX boxes, used for writing to files when a box is shipped and for putting special instructions for dvi drivers (e.g. for coloured text or for boxes).

This also leads to another interesting problem – how should ‘interdisciplinary’ be hyphenated? And what to do when the font change has no obvious correlation with parts of words? In my opinion font should be treated as a property of character ignored for hyphenation (like ligatures and kerning).