Converting LaTeX to HTML

| No TrackBacks

Documents typeset with LaTeX are usually shared electronically in PDF files or printed on dead trees. Since on the Web it is clearly better to use HTML instead of PDF, it may be useful to make these documents available in HTML. This post lists the problems associated with such conversion and programs trying to solve these problems.

Limitations

The UK TeX FAQ lists three main problems:

  • exact page formatting cannot be represented in HTML
  • mathematics can be represented only as bitmaps, tables with symbol font, or in MathML; each of these is not supported by every browser and except for MathML they are slow due to the amount of data transferred
  • not all converters support custom macros

The first problem makes some TeX documents impossible to represent in HTML in a useful way. For example, the following document’s text depends on its layout:

\documentclass{article}
\usepackage[a4paper]{geometry}
\usepackage{lipsum}

\begin{document}
\lipsum[1]

\edef\nlines{\the\prevgraf}

The previous paragraph has \nlines\ lines.
\end{document}

The text changes when the paper size is changed to e.g. A5. Clearly, any text in HTML would be different than this or false on at least some systems. Therefore the following programs will be tested just on a typical document, looking slightly like a mathematical book.

Comparing the programs

Of these listed at the UK TeX FAQ four are available in Gentoo GNU/Linux (so probably other operating systems’ package managers also allow them to be installed easily). Since two of them use tables for mathematical expressions, I haven’t used them. The two others are LaTeX2HTML and TeX4HT

As the UK TeX FAQ states, LaTeX2HTML is a Perl script does not use TeX to process the files to convert. Therefore it does not support some packages and completely ignores macros of my acronyms package. I could try to read the LaTeX2HTML manual to check if it supports loading LaTeX packages, or if I should write a special version of it for this. But I thought that it would be simpler to try just TeX4HT.

Although it ignores unknown commands, it passes unknown environments to LaTeX. Therefore all theorems in my document are rendered to bitmaps and references to them do not work correctly. Their layout is also strange. It would be possible to change LaTeX2HTML to support such environments, but it would be simpler to try other alternatives first.

TeX4HT differs from most similar programs by using TeX to interpret LaTeX code correctly. Therefore it understood my acronyms macros correctly and it would be possible to write a package which would typeset them differently for HTML. Theorems and the ‘ł’ in my name were typeset correctly (the input file used UTF-8 encoding, rarely supported by programs interpreting LaTeX input).

Mathematical formulas still were typeset (into bitmaps) incorrectly, with missing symbols and some parts converted to HTML (so they were not aligned on the math axis).

Therefore, if I will try to convert a LaTeX document to HTML, I will probably use TeX4HT and try to determine why the formulas in my sample document are not converted correctly.

No TrackBacks

TrackBack URL: http://blog.mtjm.eu/cgi-bin/mt/mt-tb.cgi/26