August 2009 Archives

PDF metadata in LaTeX with hyperref

| No Comments | No TrackBacks

Although there are many advantages of printed papers, electronic ones should have three important advantages – they are more tree-friendly, they are easily shared and searchable. This post describes how to add some metadata to documents typeset using LaTeX and shared as PDF files.

One of the most useful packages for PDFs shared on the Internet is hyperref. Unless special options are used, it will make all references to pages and sections clickable hyperlinks (very useful when reading technical texts on screen, especially if page numbers are different then in the file). The links are the main use of this package, but it also supports specifying useful metadata about the document.

The following options of hyperref specify PDF metadata (Section 3.6 of hyperref manual, the explanations are based on PDF Reference 1.7 page 844):

pdftitle
pdfauthor
title and author of the document in the common meaning of these words
pdfsubject
pdfkeywords
just subject and keywords of the document
pdfcreator
the program which made the original document, the default is appropriate
pdfproducer
the program which converted the document to PDF, also has appropriate default
pdflang
the language of the document in the format of RFC 3066 (according to the PDF Reference different languages may be specified for parts of documents, I’m not sure if any babel-like package supports this)

When these cannot be specified as package options, pass them to the \hypersetup macro in the same format.

Values of all of the above options are plain text, Unicode may be used in it (in my UTF-8 encoded document Unicode works in the argument of \hypersetup, but not in package’s options). In some cases the command \texorpdfstring{TeX string}{PDF string} may be useful in macros used both in these metadata and in typeset text.

Common internationalization problems

| No Comments | No TrackBacks

Some time ago I wrote about localization of software. This post describes some problems in using a program in language other than American English except the two trivial ones – not having a single language used by everyone or a program without localization. It is based on my experience in using free software localized to Polish, but it should apply to some other European inflected languages. Some ‘localization’ mistakes can be easily observed even in English.

In these situations translations are often incorrect:

sentence/title construction
‘Remove icon’ is clearly correct, maybe in English ‘Remove Icon’ would be also accepted. But in Polish ‘Usuń Ikona’ is incorrect. There are two problems here: lack of inflection and incorrect capitalization. In this case the problem is caused by using the normal name of the object with a general removal text. It would be solved by each object having a separate ‘Remove X’ text, e.g. ‘Remove icon’ translated into ‘Usuń ikonę’ (although it won’t make translators avoid using incorrect capitalization in their texts). The GNU Coding Standards show a different example of this.
using a single text for counted objects
‘N comments’ is a good example of this. Even in English I have found programs using the form ‘1 comments’ or ‘N comment(s)’. In Polish it is more difficult with three plurals, as stated by the GNU Coding Standards. Fortunately, for positive numbers the problem is completely solved by e.g. GNU Gettext, although having a different form for zero objects would be still better (e.g. ‘no comments’).
ignoring the grammatical gender
This may occur in construction of text about such objects as icons or floppy disks, but it is commonly found on the Web in texts about users. In English ‘he’ or ‘she’ are rarely used in messages about the user, but in many Indo-European languages nearly everything depends on gender. Fortunately, some software begins to support specifying grammatical gender of its user, like MediaWiki. (It is interesting that many roguelikes require the user to specify their gender, although they support only English.)
non-ASCII punctuation
Again, this problem can be easily shown in English. A common web browser separates its name from the page title by a hyphen while a dash should be used. Our language has also different apostrophes and quote marks than typewriters of our ancestors. For Polish it is more difficult, since even in print inner quote marks are usually put in incorrect order.

There is one simple solution – write a program which uses completely correct English and let translators correct it until it will be correct in other languages.

Using distcc with Gentoo

| No Comments | No TrackBacks

In home I use a computer with a quad-core AMD Phenom processor. Outside I use a laptop with a slow dual-core AMD Athlon 64 and only 2 GiB of RAM. Both are using Gentoo GNU/Linux. Since I update software on the laptop less often then once per week, it spends much time compiling it. Today I decided to use distcc for the Phenom-based computer to do some of this work instead of the laptop.

I used the official Gentoo Distcc Documentation to configure it. In my case it is simpler then described there. On the ‘slave’ computer I just installed distcc by emerge distcc, then changed the allowed IPs to my IPs in /etc/conf.d/distccd and started the service by /etc/init.d/distccd start (then rc-update add distccd default requested it to be started at each boot).

On the laptop I did the above and also edited the list of hosts used in /etc/distcc/hosts and two parameters for Portage integration in /etc/make.conf. Both required some experimentation to have load average on both computers below their number of cores. In the first one I finally specified only the IP of the Phenom-using machine. In the second one I added distcc to FEATURES and set MAKEOPTS to -j9 which was recommended by the documentation when using 4 cores.

top shows that this works. I should also change this configuration for packages which do not support distcc.

There are many websites accepting user’s photos and many of them want very small images. They always specify maximum file size. But many programs designed to help in such problems support only choosing image’s dimensions. So I wrote a Bourne shell script which could help with this.

There is a simple reason why we can’t simply predict appropriate image dimensions – usually the files are compressed. I’ve written previously about using PNG or JPEG on the Web and that files in these formats usually can be made smaller without any visible changes. Therefore, a program should determine appropriate dimensions, knowing how large the output file might be.

My shell script uses binary search to determine the dimensions of the image. It tries to scale a image to an integer percentage of the original size, starting with the original size, then with its half, recursing then in the half in which the optimal size lay, i.e. which has the largest dimensions with file size less than or equal to the specified value. For an image scaled to 17% of its original size it tried 9 scales, when without binary search it would test 100 cases (it would be nice if post offices used this method to find letters).

Like many other shell scripts, this one was designed to be simple. It uses ImageMagick to convert the images between any supported by it file formats and to scale them. If the output file name suggests that it is a PNG or JPEG, then appropriate program is used to make the file smaller without changing its appearance. For JPEGs comments and EXIF date are removed, since in my opinion they are not used by services requesting such small files.

The simplicity of it also leads to some possible problems with the script. Since it does not detect if the input and output files have the same format, even unscaled files are passed through ImageMagick. For JPEGs this may decrease quality. For one of KDE’s wallpapers it even increased file size by 10 KiB (not a real problem since I asked it to make a file smaller than a much large size). The binary search algorithm assumes that the list of output files with increasing dimensions is increasing, I’m not sure if this is true if any compression is used.

The script is licensed under the GNU General Public License, version 3 or later. It is available in my Mercurial repository. I tested it on Gentoo GNU/Linux with the dash shell.

Why I don’t use passwords to log in to my computers

| No Comments | No TrackBacks

I have written once about the problems of authentication to web services. This post promotes one of things used to authenticate users logging to their remote computers by SSH – public key authentication.

Several years ago I logged in to my server using a typical password. The password consisted of about thirty characters, easy to remember by me, probably difficult to guess by others. I entered the password each time when I logged in.

Now I use public key authentication. My local computers store a large, random RSA private key, encrypted using a passphrase, while the remote ones store the public key. SSH clients access the private key and authenticate without sharing it.

There were two problems with the password authentication – usability and security. I don’t like to type long, difficult to guess strings of ASCII characters often, so I use ssh-agent to store the unencrypted private key in temporary memory, entering the passphrase only once per reboot.

Passwords have many security problems. A list of four such problems by Jeff Atwood shows that attackers usually guess passwords, possibly with a large number number of tries in a dictionary attack, or get them when they are transferred. The first case is clearly improbable with a large random (in OpenSSH by default 2048 bits), the second one is impossible since the private key is not transferred.

Public key authentication is another example showing that one change may make a system both more secure and more friendly to the user.

Separate bibliographies per chapter in LaTeX with BibTeX

| No Comments | No TrackBacks

It may be uncomfortable to access bibliographical data several hundred pages from the reference. In many books there are other reasons to have separate bibliography per chapter, for example in collections of papers. This post describes my attempts to do this using a typical text typeset in LaTeX with bibliographies formatted by BibTeX.

Like for most other LaTeX-related problems, the first resource where I find information is the UK TeX FAQ. Its page about this problem lists two packages – chapterbib and bibunits. So I began reading their documentation and adjusting the document for them.

The main use of chapterbib is by having separate bibliographies per \included file. It has two problems in my case – I don’t use \include since it’s not needed with modern computers, but this would require only trivial changes (my text doesn’t need parts of two chapters on the same page, unlike some journals); another problem is the need to specify \bibliographystyle and \bibliography in each chapter. Since I prefer to specify such data only once (the bibliography style only in a document class), I could modify \include to add these commands. This doesn’t look elegant enough and did not work when I tried. The package works also without \include, but it also does not work in my case.

Then I tried bibunits. I added the following code to the preamble of the document:

\usepackage[sectionbib]{bibunits}

\defaultbibliographystyle{plainurl}
\defaultbibliography{example-bibliography}

\newcommand{\bibinput}[1]{%
  \begin{bibunit}
    \input{#1}
    \putbib
  \end{bibunit}}

and replaced \input by \bibinput to input the chapters. To call BibTeX, instead of the C shell scripts recommended by the package documentation, I used the following Bourne shell command:

for i in `seq 1 9`; do bibtex bu$i; done

where 9 is the number of \bibinputs or any larger integer.

This produced appropriate output, although with the header of the bibliography on the following page. I corrected it by redefining \@mkboth in the environment making the bibliography:

\makeatletter
\let\oldtb=\thebibliography
\renewcommand{\thebibliography}{%
  \renewcommand{\@mkboth}[2]{}%
    \oldtb}
\makeatother

With this change the document looks correct when each chapter has its own bibliography.

Comparing publishing with TeX in 1980 and in 2009

| No Comments | No TrackBacks

Several days ago I decided to read some old articles about TeX and check if they are still useful. This post continues this by comparing facts described in another article from the first TUGboat issue with my experiences with modern TeX software.

Ellen Swanson described the general way in which book and journals are prepared, from ideas of their authors to binding, and how TeX makes it cheaper. The article states that typesetting one page took about 1.5 hour before using TeX, in the same TUGboat issue Richard Palais wrote that this takes several seconds for TeX (now it is clearly faster). When the author typesets the paper using TeX much greater savings can be made.

Now these arguments still look valid, but several improvements were made to the TeX using method and new non-TeX systems were designed. Today savings with TeX should be even greater, since:

  • TeX82 is better then TeX78
  • new macro packages (e.g. LaTeX or ConTeXt) make writing papers easier, also support more logical formatting allowing using journal-specific styles in a general way
  • MakeIndex and BibTeX (or their replacements) make non-creative part of making indexes and bibliographies completely automated, journals may make their own styles of these without any changes to the source of the paper
  • instead of on a ‘magnetic tape’ the paper source could be send to the publisher by the Internet
  • diff(1) and patch(1) or a version control system could be used to transfer changes made during e.g. editing
  • there are much more materials for learning TeX (or LaTeX/ConTeXt) then in 1980

Now the biggest competition to this method of publishing are the WYSIWYG word processing programs. In my opinion the following disadvantages make their use more costly than the above method:

  • many of them are non-free are require expensive licenses
  • they are WYSIWYG and may not be automated to the same extend as TeX, requiring more work by humans
  • their data formats make it more difficult to compare different versions of a paper
  • they either do not support logical markup or their users rarely know how to do it, making journal-specific styling more difficult
  • their output depends on the computer used, so both author and publisher will optimize line or page breaks
  • producing high quality output may be more difficult or impossible

So journals typeset with TeX should be cheaper, allowing better funding for really useful work like editing or high quality printing. It is best when all papers are typeset with TeX by the authors, but it is also beneficial if only some are and the rest is written with word processors and typeset in TeX by the publisher.

Are publications about TeX from 1980 still relevant?

| No Comments | No TrackBacks

Most of the creative works made today won’t be used after the next five years. The most successful ones, like Mickey Mouse, will make the less successful ones disappear from our memories with the material on which they were made, due to copyright. Probably everyone who used a computer longer then ten years saw another reason why older works became nearly unknown – new, better works are made and replace the older ones in all their uses. I had written once about this process in the history of programming languages; for every useful software this should be observed. For every program, newer, better ones will be designed and replace it. But does this conjecture apply to TeX, the typesetting system designed and implemented in 1977–1990? Are the ideas presented in old articles about TeX still valid and relevant to current users?

This post examines only a part of the first issue of TUGboat, published in October 1980, the journal of the TeX User Group. Since this time TeX and METAFONT were rewritten in WEB (which, as Donald Knuth wrote, was designed in September 1981), LaTeX, ConTeXt and Texinfo macro packages were developed. Also printing changed completely, with PostScript introduced in 1984 and desktop publishing based on it.

The editor’s comments contain a nice description of how software helps writers – ‘TeX, AMS-TeX and METAFONT are software tools which will make the processing of scientific documents less painful, less expensive and more rapid. Authors working at editing terminals will find correcting easier and they will be spared much of the pain of proofreading’. Today ‘scientific documents’ and ’editing terminals’ changed, so this description also applies to WYSIWYG word processors. I don’t know how difficult it would be to prepare a mathematical paper using hot-metal technology, but I’m sure that it is simpler with TeX than with a WYSIWYG word processor with appropriate knowledge of both of them. It is nice to read an article from 1980s which assumed that users may read a manual in order to use a program.

Then Richard Palais described several problems related to implementing TeX in Pascal. Although there were no better alternatives to Pascal as a language for portable programs, the Pascal compilers were too incompatible for this. For this reason system dependent code in TeX was separated from the one working on all systems. Clearly portability is less important now – a few systems replaced all others in nearly all uses. Also, system dependent code is now usually shared by many programs. A program written in a high-level programming language (there are more of them then words in a typical blog post) can be run unmodified on every typical computer with every typical operating system. So the problem of writing portable programs may be considered completely solved.

After this Palais describes how to use TeX. It is nice to compare it with modern word processors. For an user of TeX it is easier to use styles designed by experts for the specific book or journal, then to design their own. This was clearly reversed by word processors, in combination with inferior defaults leading to a large decrease of typesetting quality. Then some performance numbers follow, showing how slow TeX was in 1980. Improvements in hardware and compilers changed this, but word processors are still too slow to use, unlike TeX which was useful even when producing a page took several seconds.

Then several details of the work of TeX are explained. Except for enlarged font limits, these did not changed in TeX82. It is stated that the output of TeX is device independent. Clearly, this is not true for some popular word processors or web browsers (Mozilla Firefox still prints text with glyph positioning optimized for screen). The following description of METAFONT does not state its advantages over more recently developed font formats like PostScrip Type 1, it states only that it allows fonts to be used on different devices.

Palais describes also an important advantage of typesetting papers by the authors – half of costs of journals result from the ‘activity of adding errors and then removing them’ which is done by typesetting the paper. Another cost related to a journal is storing it on paper and photocopying useful articles. Palais wrote that in future journals will be stored in electronic form and printed only when needed. For efficiency reasons, the articles would be stored in a device independent format. It’s a very useful idea, I have read about it in the electronic form of a journal article (screens now are good enough to avoid printing such articles). Clearly, electronic journals are implemented by the World Wide Web, thirteen years after the first issue of TUGboat was written. Although HTML on which the Web is based does not support ‘real’ mathematics, storing PDF files with articles (typeset by TeX) is a direct implementation of the idea.

These two texts describe both issues which had not changed and some which changed completely. TeX had improved and newer alternatives were made, but these articles still are useful. Next posts will examine more articles from this issue.

Most electronic mail clients support storing mail in folders; all of these which I used support filtering new mails into appropriate folders. When mail is divided into several folders, e.g. for some mailing lists to which I’m subscribed or for some automatically sent messages, it is easier to manage it. I prefer to read more important mail before the less important one (or less important first?) and these folders make it nicer.

It’s fine with one computer running a mail client, but the situation becomes more complicated with two computers. Then IMAP is necessary (excluding the use of webmails, but they usually use only a single computer). It is the protocol which stores mail on the server, is fast and allows realtime notifications for new messages (this feature is not currently supported by Kmail, but is planned).

I use IMAP since I first configured my mail server. Initially because most free software webmails require IMAP, but about a year later I began to access my mail from two computers. The filters were done by only one of the mail clients, so they became less useful. I could have configured them on both of my client computers, but it would require sharing the settings.

I have only one mail server, so I decided to filter the mail on the server. Quickly I found the well-known Procmail program used for that. Since I use Gentoo GNU/Linux operating system, I installed the mail-filter/procmail package. To my Postfix SMTP server configuration file /etc/postfix/main.cf I added

mailbox_command = /usr/bin/procmail

just as the comment in the file shows (as stated by the comment, I forward root’s mail to my normal user). Then the command /etc/init.d/postfix reload requested Postfix to reload the configuration file.

In my home directory I put a file .procmailrc instructing Procmail what to do with my mail. It begins with

MAILDIR=$HOME/.maildir/
LOGFILE=$HOME/.procmaillog
LOGABSTRACT=no
VERBOSE=off

which contains generic parameters used by Procmail. I use maildirs with the default Gentoo path of $HOME/.maildir. For all maildirs the trailing slash should be used in Procmail configuration, since it specifies the file format. I learned it by finding an mbox file when the slash was omitted.

Then specific recipes are stated. One of the ones which I use is

:0
* ^From: *forum-mods@gentoo\.org$
$MAILDIR/.gentoo-forums/

The first line begins with :0 and may contain specific flags which I don’t use in this recipe. The lines beginning with an asterisk contain an extended regular expression which is by default matched in the mail headers (in this case the sender address). If all conditions are satisfied, then the action specified in the last line is performed. Here it moves the mail to the gentoo-forums folder in my mailbox.

I don’t use more complicated recipes in my .procmailrc file. See the procmailrc(5) man page for the complete syntax of these files. Features explicitly stated in the man page look similar to filtering features of the Kmail mail client, but there are much larger possibilities.

liability-deltoid