September 2009 Archives

Difficulties of typesetting quote marks in LaTeX

| No Comments | No TrackBacks

Probably the most complicated to typeset punctuation marks used in English are the quote marks. Although they should be used for short and simple quotations and other simple fragments of text, they are designed for more arcane uses. This combined with the influence of typewriters makes typesetting them difficult.

Quote marks are used exactly like parentheses – they delimit a fragment of a sentence. But unlike all other such characters, inner quotation marks are different symbols than the outer ones (unless larger outer delimiters in mathematical formulas count as different symbols (they are the most ‘mainstream’ use of parentheses in parentheses)). Another difference is that ‘((’ is easily interpreted in correct way, while ‘“ needs additional spacing (‘ “).

Another problem is that each language has different quote marks. American English uses double outer quotes and inner single quotes, British English uses them as inner and outer, Polish has low double opening quote and English double closing quote, the inner quotes are the French ones, although they are rarely used correctly. American English also includes following commas and periods in the quotes, obviously this would lead to problems in programming-related texts.

LaTeX does not solve these problems, but allows direct specification of appropriate symbols. The English quote marks are represented as ``, '', ` and ', since these are the nearest equivalents on a typical keyboard. The " character is not used and the space between quotes must be specified as \, (it could be specified in the font, but this would require separate sets of fonts for American and British texts, I’m not sure if it could support third level nested quotes in any of these dialects).

It would be interesting to use just the " character and let the software decide which quotes are opening and which are closing. But even without support for nested quotations this would be difficult (if possible) to do correctly in all cases. A naïve algorithm would just begin with an opening quote and then cycle between closing and opening ones. But this won’t interpret correctly quotes in multi-paragraph dialogue, where each paragraph begins with an opening quote (in Polish dashes are used instead of quotes for dialogue and there is no possibility for humans to interpret multi-paragraph dialogue correctly without backtracking). A common mistake in delimiting block quotations with quote marks may result in a paragraph containing only a closing quote mark, so this algorithm cannot be improved by just resetting to opening quote at each new paragraph.

Emacs uses a different algorithm in the TeX-insert-quote function. It puts opening quotes after whitespace or opening parenthesis. This method could not be implemented in LaTeX, but it can be done in language-specific fonts. But this algorithm fails when quoting spaces or parentheses, like ‘(’, which is commonly done in programming-related texts.

The only problem which can be easily solved is which quotes to use. I have written a LaTeX package for this, named quoted (available in my Mercurial repository), but it does not support spaces between quote marks of different levels or moving punctuation to the quotation. There are probably many better packages for this, but this will not make a useful document ‘portable’ between e.g. British and American dialects of English, so such packages aren’t very useful.

Since parentheses are similar to quotes, but simpler, maybe a single character in source files could be used for them. In times of typewriters a slash was sometimes used instead of parentheses, since it looks similar. Is it possible to implement a LaTeX macro or virtual font replacing / by a slash or appropriate parenthesis depending on context?

Three HTML elements improving document usability

| No Comments | No TrackBacks

One of the main advantages of the Web is that nearly everyone can use it. The same document may be rendered in very different ways on different devices. This is the reason why HTML, the markup language used for most text on WWW, specifies semantics of documents instead of their appearance. Therefore many tags and attributes in HTML are not visually significant, but they can it easier to get useful information from the text. This post lists three common things which can be improved with such elements.

acronyms

Many text use large numbers of acronyms and abbreviations, but their meaning is not always remembered by the readers. Many acronyms also have more than one meaning, e.g. technical texts about AMD GPU support on GNU/Linux use the DRM acronym in two meanings – one very useful and one very harmful.

The solution is to specify the meaning using the title attribute of the acronym element, in the GPU example it would be:

<acronym title="Direct Rendering Manager">DRM</acronym>
allows more optimal use of modern hardware.

I use this element for first use of each acronym in all posts on my blog. The HTML 4.01 specification describes also the abbr element used for abbreviations. I’m not sure which one of them should be used in which situation.

link titles

It is nice to know where a hyperlink leads. Therefore it should be appropriately described, by text and additional information provided using the title attribute. Some sites have readable URIs, but they should not be the only information allowing a user to decide if the page linked to is used. Using the same example as previous, link with a title may be written in HTML as

<a href="http://en.wikipedia.org/wiki/Direct_Rendering_Manager"
   title="Wikipedia: Direct Rendering Manager">very useful</a>

Many guidelines for using link titles are specified in Alertbox by Jakob Nielsen. The simplest rules to follow when using link titles is to not duplicate nearby information in them and to provide name of the resource linked to (very useful when the context does not specify this and the URI is numeric).

definition lists

In this list it is probably useful to quickly scan names of its elements and read the more useful ones. Definition lists are formatted for such use. In HTML they are specified using three elements – dl contains the whole list, the elements of which are dt containing the defined term and dd containing the definition (each term may have many definition and each definition may have many terms).

Unlike the previous elements, definition lists are equally useful in print. This might make them popular and easy to use correctly, although commonly itemized lists with term and definition separated by a dash are used instead. In my opinion the lack of support for definition lists in a popular word processing package contributes to this (fortunately LaTeX and wikis have equally good support for this as for the other types of lists).

They clearly should be used for definitions, but the HTML specification suggests using them for dialogue, although there are arguments against it.

These elements have also a disadvantage – sites without them are probably even less usable for people who correctly specify acronyms, link titles and definition lists.

Making dashes from hyphens in LaTeX

| No Comments | No TrackBacks

Probably many users of LaTeX (including me) learned that dashes and hyphens look differently from texts about LaTeX. Many people, supported by keyboards limited to ASCII with some national and unused characters, write only hyphens, with various spacing around them, instead of dashes. Could a LaTeX user just include such text in their document and have correctly distinguished hyphens and dashes in the output? This post describes an attempt in this direction.

LaTeX already uses the ASCII hyphen character for both hyphens, minuses and dashes. If it is used in math mode, then it is a minus. Otherwise, - becomes a hyphen, -- an endash and --- becomes an emdash. The difference between endashes and emdashes lays only in their appearance, different languages require different ones with different spacing. This is the reason why I wrote the onedash package providing a single command, \dash, for typesetting the correct dash in the language and style of the document.

My new package, hyphdash, makes a dash or hyphen from a single hyphen with correct spacing. Both it, onedash and quoted (an equivalent of onedash for quotes) are available in my Mercurial repository. They are licensed under the GNU General Public License, version 3 or later.

Hyphens have two uses – they appear in compound words and in words divided across lines. Fortunately, the second use is done automatically by LaTeX and does not affect writing the package. Compound words do not have any spaces before the hyphen, but in lists like ‘mono- and polycrystals’ they may be followed by a space.

Dashes are sometimes surrounded by equal spaces – like in the British style used on this blog – or without spaces—like in the American style used in this sentence—or by unequal spaces. Usually the left space is unbreakable. This package assumes that dashes are surrounded by any normal spaces, i.e. input characters interpreted as spaces by TeX. (Unbreakable spaces appear more arcane than dashes or even inner quote marks, so they are unsupported in input by this package, but the output will have them.) TeXnical reason for this will be stated later.

My package does nearly all of its work in a macro which the hyphen made active character is defined to (see the packages’ README file of information about using this package). This expands to \relax followed by a normal hyphen if math mode is used (the \relax is probably useful in tables). The same result is obtained in horizontal mode if the current font have nonpositive space stretch parameter which probably occurs only for typewriter fonts.

In vertical mode a special dash is used, useful for representing dialogue in Polish texts (English use quote marks for this, making it easier to determine where a multiparagraph speech ends, and making inner quotes common). Probably no word begins with a hyphen, so this is used. Unix and GNU programs have commandline options beginning with a hyphen, but they are typeset in typewriter type (so there is a special case for it).

The complex part lays in the horizontal mode. Hyphens do not have leading spaces, so the are made if \lastskip does not contain positive value. So the common incorrect form of dash, alpha- beta will be kept as a hyphen. In other cases a dash is made. This ignores the possibility of having numeric ranges with endashes, like ‘69–105’. Instead, 69-105 will use a hyphen and 69 - 105 will use a much more incorrect (but more probable to be included in the input?) dash. Detecting digits before a hyphen is impossible without making all characters active (this could work only for verbatim typesetting of files, but this does not need dashes), and detecting them after the hyphen won’t distinguish such cases as ‘69–105’ and ‘2-chloro-3-methylpentane’.

There is also another problem – dashes represented by multiple hyphens. The first one will be made a dash, but the following ones (since there is no preceding space) will become hyphens. The standard ligatures for dashes will not be used. Maybe the macro could detect following hyphens, but ‘simple’ solutions like \@ifnextchar used for optional parameters will not work, since the hyphen is a complicated macro instead of a character. Changing catcodes (e.g. redefining the hyphen to a character) will not work with texts changing catcodes. The first hyphen could set a conditional to ignore following ones, but it would be difficult to change it to not ignore hyphens in the next dash. The rest is probably more difficult that these solutions. Therefore only a single hyphen may be used with this package as a dash. The macros \textendash and \textemdash may be used instead of multiple hyphens to make a dash character.

Another problem are hyphens used as minuses for numbers interpreted by TeX. For example, \hspace{-1em} will produce strange results and two error messages. In my opinion the only solutions to this are to not use a hyphen in arguments of such commands (e.g. by using the \hyphen macro or by putting all such things into the preamble), or to redefine each primitive TeX command to change the macro making dashes and hyphens (probably it is impossible to detect where such changes should be made). It is obvious why the first solution is used in the package.

This package uses also an example document as an automatic test to find regressions in future versions (I’ve written previously about this in other packages for dashes and quotes). From the example I learned about most of the limitations of this package described here.

Finding duplicate files using GNU uniq

| No Comments | No TrackBacks

The standard POSIX program uniq may list repeated lines from its output. Its implementation in GNU Coreutils supports listing all occurrences of such lines. How could this be used to list files of the same content located in a given directory?

The solution is to input to uniq with appropriate options a file consisting of lines isomorphic to the contents of files to be compared. Since uniq compares only adjacent lines, this file would have to be sorted.

Therefore a one-to-one function from files to lines should be used. Cryptographic hash functions are treated as such, although they aren’t – they accept any finite byte sequence as input and output a constant size byte sequence. There are many hashes which are now considered insecure enough for important systems (e.g. when it is easy to obtain reverse mappings or different inputs with the same output), but SHA-512 is currently used in such systems. GNU Coreutils have a program called sha512sum computing this hash of files given on the command line, so it can be easily used for this task.

Each output line of this program consists of 512 bit hexadecimal hash (i.e. 128 hexadecimal digits) and the file name, separated by several spaces. Clearly, it would be useful to know the names of repeated files, not only their SHA-512 hashes, so the output of sha512sum will be wholly passed to uniq.

The list of files for sha512sum can be generated by find $dir -type f where $dir is any directory. This command will output each file in this directory or its subdirs. The test -type f requests it to output only regular files, since they are the only ones with content.

The whole pipe listing names of duplicate files is:

find $dir -type f -print0 | xargs -0 sha512sum \
    | sort | uniq -w 128 -d --all-repeated=separate \
    | sed 's/^[0-9a-f]\+ \+//'

The command xargs passes its input as arguments to sha512sum. Since file names may contain spaces, the options -print0 of find and -0 of xargs will request them to separate the file names with zero bytes to avoid treating a file name with spaces as names of two files.

The arguments of uniq do things described before – -w 128 limits the comparison to first 128 bytes of each line, i.e. to the hash, -d omits unique lines from the output, and --all-repeated=separate outputs all repeated lines, separated by blank lines (of these three options only -d is required by POSIX and supported by uniqs used in BSDs). The final sed expression omits the hash from the output which is probably not useful.

Here sort is used only due to the way in which uniq works. It doesn’t look useful to have different files sorted according to their SHA-512 sums. It might be useful to have files in each duplicate group sorted alphabetically, but this probably could be done faster, since multiple sorts of small sequences are faster then a single sort of their sum (here also the output will be probably much smaller than the input – on my system running the above pipe on /usr/share/man gave only about one third of lines of the output of find /usr/share/man -type f, including extra blank lines). Programs like my ununiq find all repeated lines of an unsorted input, but it does not support the options necessary for this task.

Porting ununiq to OpenSolaris

| No Comments | No TrackBacks

In previous week I began writing ununiq – a program listing unique lines of its output, but unlike uniq using a hash table and treating also nonadjacent lines as duplicates. Today I decided to check if it will compile on an operating system not using the GNU C library – OpenSolaris.

Using virtualization software like KVM or Virtual Machine Manager it is easy to access several free operating systems at once, so (after some networking-related voodoo) I installed OpenSolaris 2009.06 on a virtual machine without any problem. After installation I found that I cannot log in to the system with empty password (I do not see any security advantage of a password on such virtual machine), so I installed it again with some trivial passwords. I called this new system Laurelin.

Then on Laurelin I installed GNU Autoconf and Automake in newest available versions, i.e. several releases older than on Gentoo. GLib also was trivial to install. Then I tried to use autoreconf, but it didn’t work, since unlike Gentoo OpenSolaris does not link aclocal-1.10, automake-1.10, etc, to files without version numbers. Most probably there are better solutions, but for this one use it was easier to just call these programs directly.

This showed the first problem – other systems do not have newest Autotools – so packages should support the elder ones or provide files generated by newer versions, like configure, Makefile.in and many others. In this case it was easier and probably better to remove unnecessary Automake options introduced in 1.11 and allow Autoconf 2.61.

Then running ./configure showed another problem – the shell found syntax errors in unexpanded M4 macros. This was caused by missing pkg-config, since my package did not include its macros and it wasn’t required by GLib in the OpenSolaris package system. The solution was to install pkg-config. Such problems make source-based operating system distributions more appropriate for programming, since they need such packages to install GLib.

The linker found the next problem, lack of the getline() function. My configure script did not check for it, since it wouldn’t do anything with this. Although this function is included in the newest POSIX, GNU C library treats it as a GNU extensions and other C libraries do not support it. The solution was to check how bad the portable alternatives are and choose one of them.

Therefore I added to the package an implementation of getline() from Gnulib. Then the package compiled successfully on both OpenSolaris and Gentoo.

A test suite checked if this package works correctly with a small subset of its possible options and only one input file. Porting software to other systems should begin with writing a more complete test suite, but this still was enough to improve some parts of the program.

Converting LaTeX to HTML

| No Comments | No TrackBacks

Documents typeset with LaTeX are usually shared electronically in PDF files or printed on dead trees. Since on the Web it is clearly better to use HTML instead of PDF, it may be useful to make these documents available in HTML. This post lists the problems associated with such conversion and programs trying to solve these problems.

Limitations

The UK TeX FAQ lists three main problems:

  • exact page formatting cannot be represented in HTML
  • mathematics can be represented only as bitmaps, tables with symbol font, or in MathML; each of these is not supported by every browser and except for MathML they are slow due to the amount of data transferred
  • not all converters support custom macros

The first problem makes some TeX documents impossible to represent in HTML in a useful way. For example, the following document’s text depends on its layout:

\documentclass{article}
\usepackage[a4paper]{geometry}
\usepackage{lipsum}

\begin{document}
\lipsum[1]

\edef\nlines{\the\prevgraf}

The previous paragraph has \nlines\ lines.
\end{document}

The text changes when the paper size is changed to e.g. A5. Clearly, any text in HTML would be different than this or false on at least some systems. Therefore the following programs will be tested just on a typical document, looking slightly like a mathematical book.

Comparing the programs

Of these listed at the UK TeX FAQ four are available in Gentoo GNU/Linux (so probably other operating systems’ package managers also allow them to be installed easily). Since two of them use tables for mathematical expressions, I haven’t used them. The two others are LaTeX2HTML and TeX4HT

As the UK TeX FAQ states, LaTeX2HTML is a Perl script does not use TeX to process the files to convert. Therefore it does not support some packages and completely ignores macros of my acronyms package. I could try to read the LaTeX2HTML manual to check if it supports loading LaTeX packages, or if I should write a special version of it for this. But I thought that it would be simpler to try just TeX4HT.

Although it ignores unknown commands, it passes unknown environments to LaTeX. Therefore all theorems in my document are rendered to bitmaps and references to them do not work correctly. Their layout is also strange. It would be possible to change LaTeX2HTML to support such environments, but it would be simpler to try other alternatives first.

TeX4HT differs from most similar programs by using TeX to interpret LaTeX code correctly. Therefore it understood my acronyms macros correctly and it would be possible to write a package which would typeset them differently for HTML. Theorems and the ‘ł’ in my name were typeset correctly (the input file used UTF-8 encoding, rarely supported by programs interpreting LaTeX input).

Mathematical formulas still were typeset (into bitmaps) incorrectly, with missing symbols and some parts converted to HTML (so they were not aligned on the math axis).

Therefore, if I will try to convert a LaTeX document to HTML, I will probably use TeX4HT and try to determine why the formulas in my sample document are not converted correctly.

Implementing an ‘unsorted uniq’ called ‘ununiq’

| No Comments | No TrackBacks

In this week I described one specific features of the standard Unix program uniq – it counts only adjacent identical lines as duplicates, although it would be technically interesting to implement such program without this restriction while not sorting the input. Then I wrote three implementations of such program and compared their performance on unpublished private data based on my blog’s access log.

Today I decided to make one of these three programs a nearly replacement for sort | uniq for situations where fast online algorithm without changing the order of input lines is better then possibly smaller memory use. For this I implemented support for some of uniq’s command line options and plan to write a man page.

I named the program ununiq, since this name is easy to pronounce and opposite of uniq. The program’s source is available in my Mercurial repository. Like most of my programs it is licensed under the GNU General Public License, version 3 or later.

Choosing programming language

The choice between three implementations of the basic algorithm – one in C with GLib, one in C++ with Qt and one in Python – was obvious for me. I wanted it to be:

  • faster than sort | uniq (all three programs satisfied this on the test data)
  • easy to write and maintain
  • available on all of my computers without installing too much dependencies (so without Qt, my server doesn’t need it).

Performance of my program in C with minimal C++ and Qt was worse than the one in C with GLib, so instead of using more C++ I decided to use the program written in C with GLib.

Command line options of uniq

Before this project I used an option of uniq only once, when a complicated pipe counting the number of secondary schools in each province from a CSV file contained sort | uniq -c. Now I know how to use other options of this program.

Now my program supports input and output with named files, and also skipping several leading characters of each line (the -d option). I implemented also the -d option which outputs only repeated lines, but it writes them on the second occurrence. The program would be more complicated with this option while preserving the original order, so it currently ignores this order.

The options are accepted both as short options and GNU long ones. Their support was very easy to implement using GLib command line option parser. Long options have the same names as in GNU Coreutils.

Future

Several currently not implemented options will require reading whole input before starting output, e.g. -c which counts how many times a line appears in the input. It should still be faster than sort | uniq. A similar observation may be made when sorting the output of ununiq, when operating on the perltoc man page ununiq | sort is faster by half than sort | uniq, although both make the same output.

The program needs also more documentation and testing. It would be nice with automatic testing of all options, which would also be very helpful for porting to other operating system.

Installing ReactOS on KVM virtual machine on Gentoo

| No Comments | 2 TrackBacks

It is useful to access more then one operating system when writing a program designed to be used with a different one. This is one of situations where platform virtualization is helpful. This post describes how I used KVM on a Gentoo GNU/Linux host to use ReactOS on a virtual machine.

In Gentoo’s Portage tree KVM userspace part is available as app-emulation/kvm. By default it uses its own modules, but this can be disabled by emerging it with USE flags havekernel -modules when the kernel has appropriate options enabled.

The ebuild states that user running KVM must be in the kvm group. So I added my user to this group using recommended gpasswd -a <USER> kvm To avoid logging out, I used su - <USER> in a terminal emulator, but this did not solve the problem. Then I read that this package also added rules for udev changing the group of /dev/kvm to kvm. Since I hadn’t restarted udev, I used chgrp kvm /dev/kvm as root to have this set before next reboot. Then KVM worked correctly.

KVM looks simple to use without reading the whole manual. The command kvm-img create c.img 4G made a disk image of $ GiB in the file named c.img, then I could use KVM with this by just specifying this file as a disk of the virtual machine. I used kvm -hda c.img -cdrom ReactOS.iso to install this system. It worked correctly and except for nicer keybindings similar to installation of Microsoft Windows.

After starting the newly installed system I noticed a trivial problem, kvm by default uses UTC time, while ReactOS (like Windows) supports only localtime. After checking it in the man page, I added the -localtime option to this command.

Since network access would be useful, I tried to configure it. Fortunately, KVM has not only a man page but also several HOWTOs on its website, one of them shows how to configure networking. I used user networking with additional options -net nic -net user, since it is the simplest to configure. But ReactOS didn’t have a driver for the default virtual NIC. When I had similar problem with Windows XP on real hardware, I changed the NIC to a different one, then another (from newest to oldest one), then to a one with a CD containing drivers. ReactOS website has a list of supported NICs which shows that most of them require downloading non-free drivers. But PCnet has included drivers and is supported by KVM, so I changed -net nic to -net nic,model=pcnet and it worked correctly.

Documentation of KVM shows that much more can be done in this case. It is nice to use working software with man pages encouraging learning about it.

Yesterday I wrote about listing unique lines of an unsorted file or pipe. That post contained a trivial Python script which lists unique lines of its input without sorting. Then I wrote that such program could be faster than sort -u, which obtains the same lines but sorts them, and could be more useful in some situations. So I decided to implement it in C.

The Python script looks like this:

import sys

last = set()
for line in sys.stdin:
    if line not in last:
        last.add(line)
        sys.stdout.write(line)

It is trivial to implement in C, except that it uses a set. Here set is just a hash table with any values and keys which are input lines. Since the only implementation of hash tables in C standard library requires stating its maximal size and does not support resizing (except for copying data to a larger hash table, but having more than one hash table with this is a GNU extension), I decided to use GLib’s hash table implementation. Since GLib is used by GTK+, I decided to wrote the same program also in C++ with Qt, to compare which one would be more elegant and faster.

To test performance of these programs, I used a file containing all IP addresses from this blog’s access logs (obtained by cut -d ' ' -f 1 access_log > ip-addresses). Since this was small, I copied this ten times to make a larger input file. It is nearly real world data and I remember using sort | uniq with such data. IP addresses can be sorted, but their order does not have any significant meaning here. Therefore I will measure the performance of my programs using just commands like time ./guniq < ips > ips-guniq and computed average real time of three runs. To check if results of all these programs all the same, I used diff.

For the Python implementation average time was 0.306 second. For the C one with GLib it was 0.207 second. For program written in C++ with Qt it took 0.267 second.

For the same file sort -u took 1.581 second. This clearly shows that getting unique lines of a file is faster without sorting, at least with this file. Performance of my three programs may result from lack of appropriate optimizations, for the C++ one copying of all input lines look as good reason for its longer average run time. And in case of elegance, in my opinion the Python script is best and the C++ program looks slighly simpler than the C one. Most probably the C++ program could be simpler if it used I/O interfaces typical to C++, but in this case it was simpler to use the same ones as in C.

All these programs are available in my Mercurial repository, they are licensed under GNU General Public License, version 3 or later.

Listing unique lines of an unsorted file or pipe

| No Comments | No TrackBacks

The common Unix program uniq by defaults outputs its input with repeated lines omitted. It is useful in many situations, but it supports only sorted files.

In many cases it is appropriate to sort the file, by sort | uniq or sort -u. But there are situations where the order of the lines is important. How to get the first occurrence of each line in such case, in the original order?

This trivial Python script shows how uniq might work without any options:

import sys

last = None
for line in sys.stdin:
    if line == last:
        continue
    else:
        last = line
        sys.stdout.write(line)

When tested on the largest man page in my system, perltoc.1, it gave the same output as uniq (although it was slower). This algorithm needs to store only two lines at once and is completely online, i.e. to output a line it does not need following input lines. It is clearly better for infinite texts, but for finite ones it could be made more useful when storing all previous lines.

This script stores all previous lines in a set:

import sys

last = set()
for line in sys.stdin:
    if line in last:
        continue
    else:
        last.add(line)
        sys.stdout.write(line)

For sorted file it produces exactly the same output as uniq or the previous script. For unsorted files it omits repetitions of all previous lines, not only the last one.

This algorithm also is online, since it stores only previous lines, but it has an disadvantage in comparison with other ones – it uses more memory than is used for storing the file (for efficiency reasons the set as a hash table is larger than its contents).

This could be a problem, but sorting also requires storing the whole file in memory (excluding more complicated external sorting). So when the order of lines is not important, this script would use similar amount of memory as sort -u, but it would start its output much more quickly. With optimized implementation in C it would be even faster then sort -u (since all sorting algorithms need more time than linear proportional to the input size, while this could with appropriate hash table implementation usually work in time proportional to input size).

liability-deltoid