Recently in typography Category

Difficulties of typesetting quote marks in LaTeX

| No Comments | No TrackBacks

Probably the most complicated to typeset punctuation marks used in English are the quote marks. Although they should be used for short and simple quotations and other simple fragments of text, they are designed for more arcane uses. This combined with the influence of typewriters makes typesetting them difficult.

Quote marks are used exactly like parentheses – they delimit a fragment of a sentence. But unlike all other such characters, inner quotation marks are different symbols than the outer ones (unless larger outer delimiters in mathematical formulas count as different symbols (they are the most ‘mainstream’ use of parentheses in parentheses)). Another difference is that ‘((’ is easily interpreted in correct way, while ‘“ needs additional spacing (‘ “).

Another problem is that each language has different quote marks. American English uses double outer quotes and inner single quotes, British English uses them as inner and outer, Polish has low double opening quote and English double closing quote, the inner quotes are the French ones, although they are rarely used correctly. American English also includes following commas and periods in the quotes, obviously this would lead to problems in programming-related texts.

LaTeX does not solve these problems, but allows direct specification of appropriate symbols. The English quote marks are represented as ``, '', ` and ', since these are the nearest equivalents on a typical keyboard. The " character is not used and the space between quotes must be specified as \, (it could be specified in the font, but this would require separate sets of fonts for American and British texts, I’m not sure if it could support third level nested quotes in any of these dialects).

It would be interesting to use just the " character and let the software decide which quotes are opening and which are closing. But even without support for nested quotations this would be difficult (if possible) to do correctly in all cases. A naïve algorithm would just begin with an opening quote and then cycle between closing and opening ones. But this won’t interpret correctly quotes in multi-paragraph dialogue, where each paragraph begins with an opening quote (in Polish dashes are used instead of quotes for dialogue and there is no possibility for humans to interpret multi-paragraph dialogue correctly without backtracking). A common mistake in delimiting block quotations with quote marks may result in a paragraph containing only a closing quote mark, so this algorithm cannot be improved by just resetting to opening quote at each new paragraph.

Emacs uses a different algorithm in the TeX-insert-quote function. It puts opening quotes after whitespace or opening parenthesis. This method could not be implemented in LaTeX, but it can be done in language-specific fonts. But this algorithm fails when quoting spaces or parentheses, like ‘(’, which is commonly done in programming-related texts.

The only problem which can be easily solved is which quotes to use. I have written a LaTeX package for this, named quoted (available in my Mercurial repository), but it does not support spaces between quote marks of different levels or moving punctuation to the quotation. There are probably many better packages for this, but this will not make a useful document ‘portable’ between e.g. British and American dialects of English, so such packages aren’t very useful.

Since parentheses are similar to quotes, but simpler, maybe a single character in source files could be used for them. In times of typewriters a slash was sometimes used instead of parentheses, since it looks similar. Is it possible to implement a LaTeX macro or virtual font replacing / by a slash or appropriate parenthesis depending on context?

Typesetting acronyms in LaTeX

| No Comments | No TrackBacks

This post describes some problems related to using acronyms in typeset text and some solutions to them in the form of LaTeX packages. It does not explain acronyms related to typesetting.

Problems

I’ve noticed three main problems of acronyms:
their meanings are difficult to remember
This is easily solved by appropriate text explaining the meaning, also by margin notes with the meaning or lists of acronyms, etc. For me its not a problem since just the acronym can be the meaning, in my opinion e.g. ‘DVI’ and ‘horse’ have the same rights as words.
frequent use of capital letters makes text less readable
This is easily solved by using acronyms less frequently or by using slightly smaller type for them.
different ideas are represented by the same acronym
Usually the context determines the meaning, for example ‘LSD’ as a substance and as a digit rarely is used in a single work (unless as an example of an acronym, but here the meaning is not significant). This may be a problem when the acronym is associated with certain emotions, for example one DRM is an evil limitation of freedom, while another DRM is a part of software enabling efficient use of GPUs in free operating systems.

Solutions

These problems can be solved in the following ways:

  • rewriting text to use less acronyms – the best way, although beyond the scope of this text
  • using smaller font for acronyms
  • putting the definition of the acronym on first use, e.g. in a margin note or a tooltip (as usually in this blog)
  • including a list of all acronyms used in the work with their definitions

LaTeX packages available on CTAN

I’ve found four related packages on CTAN:

acromake
This package supports defining commands for acronyms. Each will result in the full name and the acronym on first use. On the second use a reference to the definition will be made and next uses will put only the acronym.
acronym
This provides also commands allowing precise selection where full or short names should be used. A list of acronyms is also made. The manual of this package explains how acronyms may use smaller font.
glossaries
This package allows preparation of many glossaries and can be used for lists of acronyms.
glosstex
Another package for typesetting acronyms. It differs by using a program written in C.

My package

Since there are many packages for typesetting acronyms, and I don’t use most of their features, I wrote a new package for this. It is called acronyms and is available from my Mercurial repository. It differs by having evolved from very simple macros for typesetting just the acronym with appropriate spacing and smaller type, adding macros for specific acronyms and support for lists of acronyms. Then I wrote general macros for acronym definitions and added more incomplete features. Now it has optional support for acronym lists using the glossaries package, indexing chosen acronyms, and making margin notes with definitions on first use of each acronym.

Ideas for an implementation of a typesetting system

| No Comments | No TrackBacks

I’ve written several posts about TeX as a program and things which could be simpler with a typesetting system based on a different design. The basic ideas are that it will be incompatible with TeX, will be an easily extendable package written in the Python programming language with the full power of the language available for typesetting. Then I wrote about font selection a problem at first glance unrelated to the previous ones.

I haven’t written any useful code for a typesetting system, but if it will be written it will be based on these ideas. Maybe some of them might be useful without this program, so I’ve written them here.

The program will be called Tim, since it is an easy to pronounce, short name, which can be positively associated with the name and could mean ‘TeX Inspired Modules’. The name clearly shows that art is not important in this project, quality and efficiency may suffer without any useful reason. It is also easier to write about a named program than an unnamed one.

The whole code of Tim will be written in Python and published under the GNU General Public License version 3 or later. Therefore every program using this package will use the same license. This encourages sharing and making documents which are not derivative works of Tim, i.e. writing them in a format which can be treated as data, not a program.

The need of separating code and date leads to another idea. Everything except the core model of horizontal and vertical lists (based on the one used in TeX) should be trivial to replace. Many algorithms for some things are necessary, e.g. for hyphenation different ones are used for different languages, different font formats are used, etc. So every useful algorithm will be implemented in a plugin, i.e. a function or object used by code independent of this implementation. A special module will determine which plugins are used for which tasks, probably it will store this data in files (one per system, one per user and one per document).

There will be different plugins for selection of fonts (determining which fonts is nearest to the requested one), for getting metric data from different font formats, for interpreting input text (e.g. to replace spaces by glue and other characters by glyphs, possibly with some transliteration schemes), breaking paragraphs into lines, hyphenation, making lines, making pages, etc.

The similarity of pages and lines is an interesting one. In TeX they are different – a line has some glue on both sides, pdfTeX supports also kerning with the margins; while a page is produced by entirely customizable output routine. Also, total-fit is used to make optimal line breaks while first-fit makes page breaks. This results from large amount of memory used for pages and large number of lines being made. But is it possible to e.g. typeset each line of a paragraph in a different font? This would be simple when the code making a line box from a part of a horizontal list may be easily replaced by one specific to this job (in this way marginal kerning could also be implemented). This would be like an output routine, but for a line.

In this model the line breaking algorithm (I don’t call it justification, since this name looks more appropriate for make a box from a line) just makes a linear list of line hboxes from a horizontal list. The page breaking algorithm makes a linear list of page vboxes from a vertical list. Elements like glue and penalties are used in both in the same way. So it would be simpler to use exactly the same code for both things. It would also improve page breaks unless a first-fit plugin will be implemented for this.

There are also more complicated things in a typesetting system. Input languages like a one very similar to the one of TeX, and possibly XSL could be nearly separate from the typesetting code. Similarly, code for typesetting mathematical expressions with TeX-like quality (i.e. code for conversion of math lists to horizontal lists) could be completely separate from text typesetting code, it would just be used by the interpreter of the input language.

Maybe this will be a useful project or some new ideas for future TeX extensions.

After several days break I still plan to learn more about digital typography by writing a typesetting system inspired by TeX. Previously I wrote why I believe that this system should not be compatible with TeX and how using a general purpose programming language for typesetting will help. This post is about a different issue – how fonts would be represented in this system. For simplicity, I will not write about problems specific for math typesetting; text has enough problems for a post.

As stated by Vulis the plain TeX model of fonts is inadequate for e.g. academic publications. There each font is a set of 256 characters which for TeX are just boxes of specific size, combined with different characters by ligatures and kerning. Although a specific font may be scaled, this model does not provide any support for using different styles and sizes of fonts. Therefore macros used for books written by Donald Knuth (e.g. in Appendix E of The TeXbook), GNU Texinfo (texinfo.tex) and LaTeX 2.09 use static tables of different font definitions in several styles for some sizes. This approach makes using different font families clearly difficult.

Therefore LaTeX2e uses a different model, called the New Font Selection Scheme. There a font has the following attributes (from LaTeX2e font selection, the file fntguide.pdf in a TeX distribution):

encoding
the mapping of character commands to 8-bit character numbers in TeX fonts; font encodings define also ligatures used
family
this is commonly known as a typeface or font
series
e.g. medium or bold
shape
e.g. italic, roman, slanted, caps and small caps
size
the size of one em

This is clearly appropriate for the original Computer Modern fonts (the default fonts in LaTeX, the only ones known to be available in every TeX distribution since 1980s), but now it has at least the following problems:

  • font encodings are a useless waste of time and hindrance for multilingual typesetting; I believe that Unicode will be enough for everybody (imagine that a list of all its characters would not fit in a typical book)
  • slanted (or italic) small capitals cannot be easily represented in this scheme; the package slantsc allows their use as a different shape, exactly what the scheme was designed to avoid
  • usually font size is artificially limited to avoid scaling them (this was a problem before scalable fonts or automatic generation of bitmap fonts by dvi drivers)

This is different with OpenType as used with e.g. XeTeX. There a font family has equivalents of series and only roman and italic shapes (making both italic and slanted roman fonts is too difficult without METAFONT). Things like small capitals or strange ligatures are enabled by features with the same font file. Clearly, this model does not have the problems listed above.

CSS3 fonts module working draft describes another set of font attributes. It has ‘correct’ style, a one for width (one font family for Antykwa Toruńska and Antykwa Toruńska Condensed would be nice), separate attribute for small caps, and much nicer support for relative font sizing than LaTeX.

But this is not everything that can be done with a font. For TeX only the metrics are important, but still other things cannot be easily expressed there. For example, coloured or underlined hyphenated text is very difficult to obtain in TeX. Colour clearly does not affect boxes (I’m not sure how underlining affects the depth of a box), so it could be determined after breaking the paragraph into lines. Currently systems like XeTeX have specific support for such things, but in my opinion a generic method for all changes to the fonts after a page is produced is possible. So in my system I would add a one new font attribute – a Python function processing the text when a page is shipped to the output file. It would add things like colour, outlines or underlining to the text (letterspacing, although solves similarly to underlining in the soul package, would need a completely different solution, but it will be trivial in a system with complete access to hyphenation and boxes). This would be similar to whatsits in TeX boxes, used for writing to files when a box is shipped and for putting special instructions for dvi drivers (e.g. for coloured text or for boxes).

This also leads to another interesting problem – how should ‘interdisciplinary’ be hyphenated? And what to do when the font change has no obvious correlation with parts of words? In my opinion font should be treated as a property of character ignored for hyphenation (like ligatures and kerning).

Some improvements in an old flyer typeset in LaTeX

| No Comments | No TrackBacks

In September and October 2007 I designed and typeset a flyer for my school using LaTeX. Today I corrected some small typographic problems of this flyer and noticed how better it might be with the improvements in TeX distributions and my knowledge of LaTeX.

The flyer was mostly designed interactively in school consulting every change with the coordinators of international programmes described in the flyer. That is, I used a SSH client (most probably OpenSSH in Cygwin with X11; the school uses a difficult operating system) to connect with my FreeBSD server at home with Emacs 21 and LaTeX from teTeX 3. After each change to the source the resulting PDF file was downloaded using HTTP to a local PDF viewer. All of this was done using a 512 kbps connection; we waited several minutes for each transfer of about 12 megabytes PDF file.

Things have changed in last two years. Now I use a laptop with newer (i.e. less tested) software and the viewed PDF may be updated in several seconds. Emacs 23 pre-releases have support for nicely antialiased fonts and Unicode, making it much more comfortable to use than xterm. But the most important for the flyer are changes in TeXLive and how I use it.

The flyer has three typographic improvements – correct hyphenation of ‘diploma’ (by adding \hyphenation{dip-lo-ma} to the preamble), much better line breaking by font expansion (now scalable Concrete Roman fonts in T1 encoding are available, without them it would be very difficult) and first paragraphs of each section are not indented (each such texts begins by a macro to whose definition I added \noindent).

Also contact data changed. Maybe it is an appropriate reason for improvements in typeset texts for which new experience may be used?