<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>MTJM’s blog</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/" />
    <link rel="self" type="application/atom+xml" href="http://blog.mtjm.eu/atom.xml" />
    <id>tag:blog.mtjm.eu,2010-02-13://2</id>
    <updated>2010-07-03T12:36:08Z</updated>
    <subtitle>A blog about GNU/Linux, LaTeX, programming and Web development.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type 5.01</generator>

<entry>
    <title>Automatic cross references in LaTeX and their problems</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2010/07/automatic-cross-references-in-latex-and-their-problems.html" />
    <id>tag:blog.mtjm.eu,2010://2.70</id>

    <published>2010-07-03T12:26:16Z</published>
    <updated>2010-07-03T12:36:08Z</updated>

    <summary> Most non-fiction texts have numbered sections and many references to them, sometimes stating also which page is referred to. Using LaTeX numbering of pages, sections and all such references is completely automatic, making these numbers nearly always correct. However,...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="LaTeX" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="references" label="references" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[     <p>Most non-fiction texts have numbered sections and many
    references to them, sometimes stating also which page is referred
    to.  Using LaTeX numbering of pages, sections and all such
    references is completely automatic, making these numbers nearly
    always correct.  However, the method by which it is implemented is
    not completely optimal.</p>

    <p>It is used by writing the <code>\label{<em>id</em>}</code> command in
    section to be referred, where <code><em>id</em></code> identifies
    the section, preferably being easy to remember and not changed too
    often.  This makes it possible to use the
    <code>\ref{<em>id</em>}</code>
    and&nbsp;<code>\pageref{<em>id</em>}</code> commands which typeset
    the number of the section or its page number.  (References may
    lead to page of any text or a number of equation, list item,
    theorems, etc; I refer to all of them as ‘sections’ in
    this post.)</p>

    <p>During the first run of LaTeX the text is completely typeset,
    using ‘??’ instead of numbers to be referred to.  All
    <code>\label</code>s write their identifiers, section and page
    numbers to the <code>.aux</code> file.  On the beginning of the
    second run this file is read, then all references are used as they
    were correct on the previous run, and new values are written to
    the file.</p>

    <p>Here an important feature of TeX is seen&nbsp;–
    typesetting words, breaking paragraphs into lines, joining lines
    into pages, and outputting pages are done asynchronously.  To know
    the page number of a given <code>\label</code> LaTeX uses the
    primitive TeX command <code>\write</code> which evaluates
    appropriate commands during page output.  This makes it impossible
    to change text depending on current page number, so the number
    from previous run must be used instead.</p>

    <p>The same method is used for other things, like tables of
    contents, bibliographic references, indices and correct placement
    of margin notes on two-sided documents by the
    <code>mparhack</code> package.</p>

    <p>However, using multiple passes for cross references has several
    disadvantages.  The most visible one is that the time needed to
    make correct output is several times larger, although only one
    output file is needed and error messages are useful for only one
    pass (this can be improved by using <code>\batchmode</code> for
    non-first runs of LaTeX and pdfTeX’s <code>-draftmode</code>
    option for non-last runs). Despite this, it is clearly visible
    that most of work during non-last passes is unnecessary
    (especially when referring only to numbers of sections, not
    pages).</p>

    <p>Since the same files are modified in each pass, it is difficult
    to optimally use <code>make</code> or another generic build system
    with LaTeX.  This leads to longer processing than necessary and
    uncertainty of the document having outdated references.</p>

    <p>It is even possible to make a document which has always
    incorrect references.  This document shows this problem:</p>
    <pre>\documentclass{minimal}

\pagestyle{empty}
\pagenumbering{roman}

\setlength{\textwidth}{8pt}
\setlength{\textheight}{10pt}
\setlength{\parindent}{0pt}

\makeatletter
\@ifundefined{pdfpagewidth}{}{%
  \pdfpagewidth=2in
  \pdfpageheight=2in
}
\makeatother

\begin{document}
\setcounter{page}{9}
\pageref{x}\hspace{0pt}i\label{x}
\end{document}</pre>
    <p>(The part with <code>\pdfpagewidth</code> is to make it easier
    to see both pages at once in a <abbr title="Portable Document
    Format">PDF</abbr> viewer, I have described it in <a href="http://blog.mtjm.eu/2010/05/making-pdfs-of-correct-paper-size-with-latex.html" title="Making PDFs of correct paper size with LaTeX">a previous
    post</a>.)</p>

    <p>Since the second pass, this document will oscillate between having one
    or two pages.  When the reference leads to page&nbsp;x, then the
    ‘i’ is on page&nbsp;ix, but with reference to
    page&nbsp;ix the ‘i’ is on page&nbsp;x.  (Leslie Lamport states in
    <em>LaTeX: A Document Preparation System</em> that using Roman numerals may lead to this problem, I did
    not know any specific example of such document before writing the
    above one.)</p>

    <p>So despite being very useful, automatic cross references in
    LaTeX have some disadvantages.  Usually a good enough solution is to run LaTeX on a document some times longer than possible necessary, and change text to avoid having infinite loops in this process.  Could it be improved?
    I’ll write about some other ways to avoid these problems in a separate post.</p>
]]>
        
    </content>
</entry>

<entry>
    <title>Making PDFs of correct paper size with LaTeX</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2010/05/making-pdfs-of-correct-paper-size-with-latex.html" />
    <id>tag:blog.mtjm.eu,2010://2.69</id>

    <published>2010-05-20T14:51:26Z</published>
    <updated>2010-05-20T15:39:57Z</updated>

    <summary>One of the nicest features of LaTeX is having good layout by default. For many English uses the standard document classes are appropriate, for some non-English languages there are good adaptations for local typographic conventions (like mwcls for Polish texts)....</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="LaTeX" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="pdf" label="PDF" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>One of the nicest features of LaTeX is having good layout by default. For many English uses the standard document classes are appropriate, for some non-English languages there are good adaptations for local typographic conventions (like <a href="http://tug.ctan.org/cgi-bin/ctanPackageInformation.py?id=mwcls" title="CTAN: mwcls package information"><code>mwcls</code></a> for Polish texts). As necessary for internationalized use, they support many paper sizes. Unfortunately, if a <abbr title="Portable Document Format">PDF</abbr> is generated from such document, it is treated by viewers as a Letter or A4 page even if it is completely different.</p>
<p>The <a href="http://www.tex.ac.uk/cgi-bin/texfaq2html?label=papergeom" title="Getting the right paper geometry from (La)TeX">UK TeX FAQ</a> explains this problem and suggests several solutions&nbsp;– all of which are packages or document classes using their own page layouts. Since the document classes which I use already have appropriate layout, I decided to solve this problem differently.</p>
<p>As stated in the FAQ and the pdfTeX manual (page&nbsp;20, the <code>texdoc pdftex</code> command in TeXLive shows this documentation), the <code>\pdfpagewidth</code> and <code>\pdfpageheight</code> dimensions are used to set the PDF page size. LaTeX stores the same data in <code>\paperwidth</code> and <code>\paperheight</code> (<code>classes.pdf</code> page&nbsp;3 as available in TeXLive). Manuals of XeTeX and LuaTeX show that these TeX implementations also support these commands, so using any of them the following code will set the appropriate page size:</p>
<pre>\pdfpagewidth=\paperwidth
\pdfpageheight=\paperheight
</pre>
<p>This code won’t work with TeX implementations without these commands. When these commands are undefined it will produce some error messages and output some text. The simplest solution is to not use these commands when they are not available, for example with the following code:</p>
<pre>\makeatletter
\@ifundefined{pdfpagewidth}{}{%
  \pdfpagewidth=\paperwidth
  \pdfpageheight=\paperheight
}
\makeatother
</pre>
<p>The <a href="http://www.tex.ac.uk/cgi-bin/texfaq2html?label=ifpdf" title="Am I using PDFTeX?">UK TeX FAQ</a> suggests using the <code>ifpdf</code> package for detecting PDF support which here would be too specific&nbsp;– support for XeTeX would require using another package, also it is not a problem to use the above code when making a <abbr title="Device independent">DVI</abbr> file. It would not make DVIs with correct page size, but setting this would depend on the DVI driver and PDF-producing TeX variants have other useful advantages.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Formatting dates in LaTeX with Babel</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2010/02/formatting-dates-in-latex-with-babel.html" />
    <id>tag:blog.mtjm.eu,2010://2.68</id>

    <published>2010-02-24T11:24:28Z</published>
    <updated>2010-02-24T12:06:10Z</updated>

    <summary>A nice thing about LaTeX is support for automatic date formatting on article’s title pages using the \today macro. It even supports many languages (using Babel or completely language-specific packages). Unfortunately, (as the name suggests) it typesets only the current...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="LaTeX" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="localization" label="localization" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>A nice thing about LaTeX is support for automatic date formatting on article’s title pages using the <code>\today</code> macro. It even supports many languages (using Babel or completely language-specific packages). Unfortunately, (as the name suggests) it typesets only the current date, while all other must be written ‘by hand’.</p>
<p>This would not be appropriate way of writing e.g. certificates of attendance for a Polish <abbr title="International Baccalaureate">IB</abbr> school, where on each page many dates chosen by the school are written in both British English and Polish languages. I wanted to write only a single date in a simple format, like <code>2010/02/24</code>.</p>
<p>The first solution which I used was a complicated package parsing this date format into separate macros for year, month and day (it also removed leading zeros), and then typesetting them using separate macros for different languages, with 12&nbsp;macros per language for month names. It had at least the following problems:</p>
<ul>
<li>difficult to understand or maintain macros</li>
<li>many per-language macros</li>
<li>the output was very similar to the <code>\today</code> macro.</li>
</ul>
<p>Therefore I have rewritten this package today, using the <code>\today</code> macro after changing the ‘current’ date.</p>
<p>TeX stores the current date in three parameters: <code>\year</code>, <code>\month</code> and <code>\day</code> (<em>The TeXbook</em>, page&nbsp;273). This trivial macro changes them using the <code>2010/02/24/</code> formatted date:</p>
<pre>\def\@setcurrentdate#1/#2/#3/{%
  \year=#1%
  \month=#2%
  \day=#3%
  \relax
}
</pre>
<p>The interface of this macro is strange, but it looks like a simple way of separating parts of the input date. The date-formatting macro will expand its second argument and add the trailing slash (so the whole day will be used, instead of only its first digit) by <code>\expandafter\@setcurrentdate#2/</code>.</p>
<p>Babel provides at least two macros for changing languages: <code>\foreignlanguage</code> and <code>\selectlanguage</code>. The first one has more appropriate interface, but does not change the date format (as stated on page&nbsp;6 of the Babel manual), so I used the second one.</p>
<p>The whole macro formatting the date has two arguments&nbsp;– language name as used by Babel and the date in slash-separated format. The macro is:</p>
<pre>\newcommand{\Date}[2]{{%
    \selectlanguage{#1}%
    \expandafter\@setcurrentdate#2/%
    \today
  }}
</pre>
<p>(The name is capitalized, since I use it only in some document classes.) I’m not sure if specifying language language for each date is useful (dates are usually in the same language as surrounding text), but it could be trivially removed in useful packages.</p>
<p>I haven’t seen this way of extending LaTeX macros in other code, but it looks like an useful advantage of mutable parameters for ‘constants’ like current time.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Using a blog software without server-side scripting</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/12/using-a-blog-software-without-server-side-scripting.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.19</id>

    <published>2009-12-29T18:09:41Z</published>
    <updated>2010-02-13T21:37:55Z</updated>

    <summary>The software previously used on this blog (Zine) keeps the text in a relation database (in this case PostgreSQL) and on each request formats a page using this data and some Python code. Most popular blog software use exactly the...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="WWW" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="blogsoftware" label="blog software" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="serversidescripting" label="server-side scripting" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>The software previously used on this blog (<a href="http://zine.pocoo.org/" title="Zine website">Zine</a>) keeps the text in a relation database (in this case <a href="http://www.postgresql.org/" title="PostgreSQL website">PostgreSQL</a>) and on each request formats a page using this data and some Python code. Most popular blog software use exactly the same paradigm, although most are not written in Python and many use only MySQL for the database.</p>
<p>The problems with this solution is that the same pages are generated many times and much simpler software could be used for this. On a typical blog updated much less often than viewed exactly identical pages are generated multiply times, using multiply times more resources than necessary. Therefore hosting of typical blogs could be much simpler and cheaper than is possible with such technology.</p>
<p>Therefore very large sites (or sites using very large software) use separate caching servers, like <a href="http://varnish.projects.linpro.no/" title="Varnish website">Varnish</a>. Such programs get a page from the original server and keep it for some time, giving it much faster without regenerating the page for next requests. This solution still does not support nicely sites changed for every user (so usually the cache is skipped for 'non-anonymous' users) and it is difficult to avoid giving outdated pages from the cache (at least while the cache is used for unchanged pages). (Another problem is that another daemon must by running on the server, allowing friendly 503&nbsp;<abbr title="Hypertext Transport Protocol">HTTP</abbr> errors when one of the daemons serving the page does not work.)</p>
<p>All problems of such caching servers could be avoided by correctly generating static files for each page when something is changed. If the files will be generated on a different machine than the server, then it could be written in completely different, maybe better, ways than the software used on the server. The pages will be uploaded to the server and a simpler HTTP server would send them much faster than with any other solution.</p>
<p>This solution would clearly require generation of all uploaded files by a single user with access to the whole site. So there won't be any multiuser things, there will be no search, and the sites will not depend on current time (this is used for relative, friendly dates in texts like 'written five hours ago', this could be easily done client-side using JavaScript). Still, these things look uncommon on a typical blog written by a single user.</p>
<p>The problem is that pages of a simple blog depend usually not only on their own content and other posts but also on comments posted by users. A server-side script is necessary to get the comments, but it won't be a problem since this is exactly what such scripts are for. There are two possible things to do with the comments obtained by the script&nbsp;- adding them to the post page or putting them in a private place from which the user would move the comments to be published on next update. The first solution requires making the page 'less static', but the server-side code would be still much simpler then usually. The second solution is useful also due to useless spam (ignored by both readers, search engines and writers) being sent by malevolent bots as comments.</p>
<p>Therefore such 'offline' blog software would work well enough for small sites. They would be also able to do things which are too slow to be done by advanced server-side scripts, for example checking if the example source code written in a post about the C&nbsp;programming language can be compiled (and maybe even run and generate the output shown in the post). Such software improvements could make higher quality posts easier. Also, as a locally used program, it could be more user-friendly than Web-based solutions (it could e.g. allow using standard Unix tools to correct a typo in many posts, or use more helpful editors than available in a browser).</p>
<p>Maybe it is worth writing such a program (or finding and using an existing one). Certainly the possibility of converting an existing blog to use such software would not be trivial (e.g. to avoid duplicate entries in feed readers and to import all useful data like comments), but it might have more benefits.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Comparing performance of common Unix shells</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/11/comparing-performance-of-common-unix-shells.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.13</id>

    <published>2009-11-15T14:29:00Z</published>
    <updated>2010-09-16T12:53:38Z</updated>

    <summary>Probably most programs written by a user of a GNU/Linux operating system are scripts interpreted by programs inspired by the Bourne shell. Although most of their work is either interactive (so most probably faster than a human can see) or...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="programming" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="shellscript" label="shell script" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>Probably most programs written by a user of a <acronym title="GNU’s not Unix">GNU</acronym>/Linux operating system are scripts interpreted by programs inspired by the Bourne shell. Although most of their work is either interactive (so most probably faster than a human can see) or done by efficient C&nbsp;programs, it would be interesting to compare how the choice of shell affects the time needed to run some scripts.</p>
<p>Most GNU/Linux systems use <a href="http://tiswww.case.edu/php/chet/bash/bashtop.html" title="The GNU Bourne-Again SHell">Bash</a> as their only shell. This is different on <acronym title="Berkeley Software Distribution">BSD</acronym> derivatives like <a href="http://www.freebsd.org/">FreeBSD</a> using <a href="http://www.in-ulm.de/%7Emascheck/various/ash/" title="Almquist Shell"><code>ash</code></a> for scripting and <a href="http://www.tcsh.org/Welcome"><code>tcsh</code></a> (a C&nbsp;shell derivative with largely different syntax than other shells) for interactive use.</p>
<p>Most shell scripts do not use specific features of any shell and need just a mostly-POSIX-compatible shell like <code>dash</code> (an <code>ash</code> derivative) or Bash. Therefore they specify <code>/bin/sh</code> as their interpreter, which is always such shell. In most GNU/Linux distributions Bash is used as <code>/bin/sh</code>, while in BSDs <code>ash</code> is used, and Ubuntu and <a href="http://wiki.debian.org/DashAsBinSh" title="Debian Wiki: dash as /bin/sh">Debian Squeeze</a> use <code>dash</code>. Therefore many scripts using Bash-specific features declare incorrectly to be used with the default shell and fail on Ubuntu or FreeBSD.</p>
<p>Avoiding the above problem by testing scripts with shells having only the features required by POSIX is not the only reason to use non-Bash shells for scripting. The <code>dash</code> shell is faster then Bash, this is why <a href="http://lists.debian.org/debian-release/2007/07/msg00027.html" title="Proposed release goal: Switch to dash as /bin/sh to speed up the boot">it was proposed for Debian Lenny release</a> to use <code>dash</code> as the default shell for scripts.</p>
<p>To check how time performance of different shells differs, I wrote several trivial scripts which can be interpreted by the popular POSIX-like shells. Two of the scripts calculate factorials using different recursive algorithms (one is the ‘standard’ definition used in mathematical textbooks, the other one is the tail-recursive one used in functional programming textbooks), another one calculates elements of the Fibonacci sequence using the recursive definition, the fourth one just calls the shell about one hundred times to check how slow is its initialization. I haven’t seen a real shell script doing such things, but the ones which I normally use depend mostly on other program performance or use Bash-specific features. Another script calculates average time spent by each script and shell combination from ten runs (one additional run of each is done before counting, since this needs loading the shell from the disk) and outputs the result in a simple to parse format.</p>
<p>I compared six shells available in Gentoo GNU/Linux ebuilds <code>sys-apps/busybox-1.15.2</code>, <code>app-shells/bash-4.0_p35</code>, <code>app-shells/dash-0.5.5.1.2</code>, <code>app-shells/mksh-39</code>, <code>app-shells/pdksh-5.2.14-r4</code>, <code>app-shells/zsh-4.3.10</code>. The average times in seconds on the machine which I’m using calculated by the script are:</p>
<table>
<thead>
<tr><th>Script</th><th><code>bb</code></th><th><code>dash</code></th><th><code>bash</code></th><th><code>zsh</code></th><th><code>mksh</code></th><th><code>pdksh</code></th></tr>
</thead>
<tbody>
<tr><td>tail-recursive factorial</td><td>0.12</td><td>0.083</td><td>0.254</td><td>0.23</td><td>0.122</td><td>0.117</td></tr>
<tr><td>standard factorial</td><td>0.106</td><td>0.084</td><td>0.229</td><td>0.242</td><td>0.12</td><td>0.121</td></tr>
<tr><td>Fibonacci sequence</td><td>1.061</td><td>0.801</td><td>2.177</td><td>2.063</td><td>1.044</td><td>1.301</td></tr>
<tr><td>recursive shell invocation</td><td>0.341</td><td>0.269</td><td>0.515</td><td>1.91</td><td>0.378</td><td>0.349</td></tr>
</tbody>
</table>
<p>For all above tests <code>dash</code> is the fastest, BusyBox and Korn shell variants have similar performance, while Bash or <code>zsh</code>  is the slowest one. Bash was two to three times slower than <code>dash</code> for these tests.</p>
<p>Of course, real scripts are something completely different. Probably everyone who wants to write functional programs knows more appropriate languages than POSIX shells. Also, extensions of many shells probably might make them faster for some scripts using them. The main reason for shell scripting is the ease of writing trivial scripts similar to commands written for daily interactive use. Therefore it is more useful to write a simple script and rewrite it in a better language when needed.</p>
<p>The scripts used for the above calculations are available in <a href="http://hg.mtjm.eu/shell-performance-comparison/">my Mercurial repository</a>. The main script is licensed under the GNU General Public License, version&nbsp;3 or later, while the tested scripts are public domain, since I hope that these are too unoriginal to be copyrightable.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Some limitations of popular free Web log analyzer software</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/11/some-limitations-of-popular-free-web-log-analyzer-software.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.23</id>

    <published>2009-11-01T14:25:00Z</published>
    <updated>2010-02-22T17:41:06Z</updated>

    <summary>It is useful for a blogger to know how their site is used. Understanding which information the users are searching for, which sites linked them to it, relationship between post’s popularity and weekday, might help making more useful content. But...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="WWW" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="programming" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="loganalyzers" label="log analyzers" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="spam" label="spam" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>It is useful for a blogger to know how their site is used. Understanding which information the users are searching for, which sites linked them to it, relationship between post’s popularity and weekday, might help making more useful content. But getting such information should not harm the users, i.e. not increase the amount of useless scripts which they must download and not waste time which the blogger might use to write useful texts or to communicate with others.</p>
<h3>Sources of data</h3>
<p>Most Web servers store in their <em>access logs</em> some data about each request, like the user’s <acronym title="Internet Protocol">IP</acronym> address or the referring page <acronym title="Uniform Resource Locator">URL</acronym>. There are many formats of such data, but they all share three important things:</p>
<ul>
<li>no additional work is done client-side</li>
<li>only data specified in the <acronym title="Hypertext Transfer Protocol">HTTP</acronym> headers is used</li>
<li>all accesses are logged, including these from robots.</li>
</ul>
<p>The problem with such data is that it does not specify some information known only by the user’s browser (called more formally an <em>user agent</em>), like screen resolution, support for JavaScript or some useless plugins. Other information coming from the user are trivially forged, malicious bots happily pretend to be real browsers coming by links from other pages.</p>
<p>A partial solution to this is to use JavaScript code and zero-sized images without caching to get these information from the client. But this requires more requests per page view (especially from different servers, these makes page loading much slower), and it ignores users who disable JavaScript or use browser extensions blocking such code for privacy/performance reasons.</p>
<p>Although there is nothing specific to free software, this situation leads to many problems with programs analyzing Web server logs.</p>
<h3>Some uses of the data</h3>
<p>These are several possible uses of data stored in the access logs:</p>
<ul>
<li>finding which topics are popular and worth expansion</li>
<li>comparing posts with search keywords leading users to them, maybe they could be more useful for common visitors</li>
<li>blocking access for bots which do not benefit potential users and waste bandwidth</li>
<li>finding other blogs linking to the site, they might have useful information on similar topics</li>
<li>comparing effectiveness of different posting schedules</li>
<li>finding possible problems, like broken incoming links</li>
<li>determining how specific browsers or operating systems are popular among the readers</li>
</ul>
<p>All of these might be used to make the site more useful. The programs should make it easy, but it is not as simple as it seems.</p>
<h3>How spam makes it difficult</h3>
<p>For most uses only data about human visitors is helpful. Only to block unfriendly bots or to correct technical problems data about bot visits is needed.</p>
<p>The problem is that only the useful bots want to be identified as bots. The ones which send spam, copy content to spam sites, get mail addresses to send spam, spam etc, do not want to be known&nbsp;– this would make it trivial to disallow their visits. So they pretend to use popular Web browsers and use many IP addresses without any clear pattern.</p>
<p>Many spam bots can be easily identified by using identifications of very old browsers (some of which could not access the site due to changes in the Web protocols), or by strange usage patterns like visiting only a single page referring from the same page and not getting any styles or images. They also go to URLs used by insecure Web applications and pretend to visit from certain sites in hope of getting a link to these sites (it is called <a href="http://en.wikipedia.org/wiki/Referrer_spam" title="Wikipedia: Referrer spam"><em>referrer spam</em></a>). This spam is useless in most cases, since the referrer URLs are not published on properly written sites excluding ones like password-protected log analyzer reports (with all links marked to be ignored by search engine crawlers). But it still makes the log analyzers less useful.</p>
<h3>Problems of common log analyzers</h3>
<p>One of the most visible things which I observed after visiting the <a href="http://en.wikipedia.org/wiki/List_of_web_analytics_software">Wikipedia list of Web log analyzers</a> is that most of them are very old. Of the ones not using MySQL or PHP one had last release in&nbsp;2004, another does not try to ignore visits by bots in statistics generated, using another one is the main inspiration for this post.</p>
<p>Clearly, identification of new browsers and operating systems, proper determination of queries from new (or renamed) search engines, and detection of malicious bots requires changes in software. So I believe that projects without new releases in this year do not detect new things and have problems making them less interesting to improve.</p>
<p>Another problem is that URLs are usually not unique for a given content, although they should. This is most common with forum software written in PHP, they use different URLs for each user. Therefore log analyzers treat each visit from a forum thread as a visit from a different page. This makes lists of referring URLs much less friendly to humans who are more interested in pages than their specific URLs.</p>
<p>There are probably no perfect solutions for the spam in statistics, but the programs could vastly decrease its amount by trivial measures.</p>
<h3>Solutions</h3>
<p>There are two methods of solving these problems&nbsp;– correcting an existing program or writing a new one. Since most of free software log analyzers are written in C, which is better for much different programs, or Perl, which is appropriate for much smaller programs and probably encourages committing some of their possible design mistakes, it would be difficult for me. Maybe it would be an interesting learning experience to write another faulty log analyzer?</p>]]>
        
    </content>
</entry>

<entry>
    <title>Six tips for optimization of homework C programs</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/10/six-tips-for-optimization-of-homework-c-programs.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.24</id>

    <published>2009-10-17T10:47:00Z</published>
    <updated>2010-02-13T19:46:08Z</updated>

    <summary>A nice aspect of formal computer science education is that it requires writing useless programs in a low-level programming language like C which pass automated tests with harsh time and memory limits. The following list shows several methods which I...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="programming" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="c" label="C" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="optimization" label="optimization" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>A nice aspect of formal computer science education is that it requires writing useless programs in a low-level programming language like C which pass automated tests with harsh time and memory limits. The following list shows several methods which I used to optimize such programs and examines their usefulness for ‘real’ programs.</p>
<h4>1. Don’t optimize uncompleted programs</h4>
<p><a href="http://catb.org/~esr/writings/taoup/html/ch01s06.html#rule_of_optimization" title="The Art of Unix Programming: Rule of Optimization">Eric S. Raymond</a> explains this by the fact that incomplete programs are not understood completely, so a thing considered not an optimization may make the program slower when it will be completed. In case of small programs for batch processing of data there is another reason – slow but easy to understand program may generate output which will be compared to output of optimized versions.</p>
<p>For real programs there is another advantage – a simple prototype might be used to describe the design and allow much faster development than a fast program written in C. Once I wrote a trivial Python script to explain <a href="http://blog.mtjm.eu/2009/9/1/listing-unique-lines-of-an-unsorted-file-or-pipe" title="Listing unique lines of an unsorted file or pipe">a limitation of standard Unix program <code>uniq</code></a>. It was so simple that in next few days I decided to implement a <a href="http://blog.mtjm.eu/2009/9/4/implementing-an-unsorted-uniq-called-ununiq" title="Implementing an ‘unsorted uniq’ called ‘ununiq’">complete <code>uniq</code>-like program in C</a>.</p>
<h4>2. Don’t optimize fast enough programs</h4>
<p>In case of homeworks the aim is usually to make a program passing some automated tests. If it is possible to test the program many times, it would be not useful to optimize it after all tests are passed in required time.</p>
<p>In real life programs are used with more than several different input files, so this argument does not apply to non-homework programs. But there <a href="http://catb.org/~esr/writings/taoup/html/ch12s01.html" title="The Art of Unix Programming: Don’t Just Do Something, Stand There!">the Moore’s Law</a> makes most optimization pointless.</p>
<h4>3. Measure before any change</h4>
<p>It is very difficult for humans to predict performance effect of a change in a program. For example, C programs with a loop consisting of billions of pre- or postincrements of a variable saved to another variable produce on my machine assembly code differing only by the order of instructions in the loop. It looks obvious that these programs would work for the same amount of time, but the one with preincrements was faster by about 25% when using an AMD Phenom processor. In this case the correct solution was to enable compiler optimizations which removed the whole loop, but in many situations it is better to make small changes and compare their effect on the program performance.</p>
<p>In real Unix programs <a href="http://catb.org/~esr/writings/taoup/html/ch12s02.html" title="The Art of Unix Programming: Measure before Optimizing">profilers</a> are used for this, but for programs containing only one function it is appropriate to just measure the time spent running the whole program on a large input.</p>
<p>If a typical input takes one microsecond to process, then it might be necessary to write a script making much larger inputs. Since they would be too large to have output correctness verified by human, an unoptimized ‘correct’ implementation of the program being improved is useful to generate output for regression testing.</p>
<h4>4. Don’t use dynamic memory allocation</h4>
<p>As explained by <a href="http://www.joelonsoftware.com/articles/fog0000000319.html" title="Joel on Software: Back to Basics">Joel Spolsky</a>, <code>malloc</code> and similar functions use a slow algorithm to decide which memory is free. This probably could have been improved since 2001, but still is slow. Fortunately many homework problems specify limits on input size which may allow using statically allocated arrays for everything.</p>
<p>In real programs <code>malloc</code> is often avoided by ignoring characters after the 2047 column of a line or by crashing the program when they occur. An advantage of many <acronym title="GNU’s not Unix">GNU</acronym> programs is that they do not limit input length and dynamically allocate input buffers. Therefore this tip is mostly useless for non-homework problems.</p>
<h4>5. Don’t copy memory</h4>
<p><acronym title="Central Processing Unit">CPU</acronym>s are thousands of times faster than memory. Therefore operations on large amounts of memory become the performance bottleneck of programs without significant <acronym title="Input/Output">I/O</acronym>.</p>
<p>Performance of a program can be improved by using a single buffer for e.g. an input line, instead of copying it many times. In some cases code reading the input might do some calculations on it which will lead to storing much less data in memory. This might be a useful improvement in case of programs performing <em>O</em>(<em>n</em>) algorithms on these data.</p>
<h4>6. Write small and simple programs</h4>
<p>Since memory is slow, modern CPUs store some of it in faster and more expensive caches. Therefore smaller code might be stored completely in cache and avoid slow memory access, as explained by <a href="http://catb.org/~esr/writings/taoup/html/ch12s03.html" title="The Art of Unix Programming: Nonlocality Considered Harmful">Raymond</a>.</p>
<p>This is done mainly by removing optimizations designed for older processors. A nice example of this is the removal of many uses of <a href="http://en.wikipedia.org/wiki/Duff%27s_device" title="Wikipedia: Duff’s device">Duff’s device</a> from the XFree86 X11 server. As stated by <a href="http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html" title="An Linux Kernel Mailing List post">Theodore Ts’o</a>, ‘by eliminating all instances of Duff's Device from
the XFree86 4.0 server, the server shrunk in size by _half_ _a_
_megabyte_ (!!!), and was faster to boot’.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Using Gentoo on a server without C++ compiler</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/10/using-gentoo-on-a-server-without-cxx-compiler.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.12</id>

    <published>2009-10-04T13:48:00Z</published>
    <updated>2010-09-16T12:55:41Z</updated>

    <summary><![CDATA[After reading a post by Diego&nbsp;E. Pettenò about replacing groff with heirloom-doctools leading to having nearly no C++ programs in a Gentoo system, I tried to rebuild all packages on my Gentoo-using server without C++. The main difference is that...]]></summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="Gentoo GNU/Linux" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="c" label="C++" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="ebuilds" label="ebuilds" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>After reading a post by Diego&nbsp;E. Pettenò about <a href="http://blog.flameeyes.eu/2009/09/29/another-c-piece-hits-the-dust">replacing <code>groff</code> with <code>heirloom-doctools</code></a> leading to having nearly no C++ programs in a Gentoo system, I tried to rebuild all packages on my Gentoo-using server without C++. The main difference is that my computer compiles all packages instead of using another system for this, so I need also all build dependencies of useful packages compiling and working without C++.</p>
<p>I started by adding the <code>nocxx -cxx</code> USE flags to everything except GCC. Rebuilding affected packages showed that <code>app-arch/lzma-utils</code> has programs written in C++ for everything except decompression. So I checked on my unstable Gentoo workstation that its <code>app-arch/xz-utils</code> does not have programs linking to <code>libstdc++</code>, which is nearly equivalent to using C++. Therefore I added <code>app-arch/xz-utils</code> to <code>/etc/portage/package.keywords</code> and installed it. Now both LZMA and XZ compression formats may be both compressed and decompressed, without any C++ code.</p>
<p>Since C++ support is assumed to be in every system, the package manager cannot determine which packages use it. Therefore I decided to check which programs link to the C++ standard library. The following shell script lists all files on <code>$PATH</code> which link to libraries containing <code>++</code> in their names:</p>
<div class="syntax"><pre><span class="c">#!/bin/sh</span>

<span class="k">for </span>directory in <span class="sb">`</span><span class="nb">echo</span> <span class="nv">$PATH</span> | sed <span class="s1">'s/:/ /g'</span><span class="sb">`</span>;
<span class="k">do</span>
<span class="k">  for </span>f in <span class="nv">$directory</span>/*
  <span class="k">do</span>
<span class="k">    if </span>file <span class="nv">$f</span> | grep ELF &gt; /dev/null
    <span class="k">then</span>
<span class="k">      </span><span class="nv">list</span><span class="o">=</span><span class="sb">`</span>readelf -dw <span class="nv">$f</span> | grep Shared <span class="se">\</span>
        | sed -r <span class="s1">'s/^.*\[(.*)\].*$/\1/'</span> | fgrep ++<span class="sb">`</span>
      <span class="k">if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="o">=</span> 0 <span class="o">]</span>
      <span class="k">then</span>
<span class="k">        </span><span class="nb">echo</span> <span class="nv">$f</span>
      <span class="k">fi</span>
<span class="k">    fi</span>
<span class="k">  done</span>
<span class="k">done</span>
</pre></div>

<p>Passing the output of the script to <code>xargs qfile -q | sort -u</code> listed packages having these files.</p>
<p>The only necessary packages on my system listed by the above pipe were <code>app-arch/lzma-utils</code> and <code>sys-apps/groff</code>, but they can be easily replaced by <code>app-arch/xz-utils</code> and <code>app-doc/heirloom-doctools</code>.</p>
<p>Then I compiled <code>sys-devel/gcc</code> with the <code>nocxx</code> USE flag and ran <code>emerge -ev --keep-going world</code>.</p>
<p>Next day I saw that build failed for 27&nbsp;packages. Some of them, like <code>sys-apps/sed</code> had just failing tests for completely unrelated reasons. Many other packages, like <code>sys-devel/libtool</code> have tests using C++. So I recompiled all such packages with <code>FEATURES=-test</code>. Now only 12&nbsp;packages failed.</p>
<p>The rest failed mostly when running the <code>configure</code> script, with messages like this:</p>
<div class="syntax"><pre>checking how to run the C++ preprocessor... /lib/cpp
configure: error: C++ preprocessor "/lib/cpp" fails sanity check
</pre></div>

<p>These were caused by useless checks for C++ compiler made by older versions of <code>libtool</code>. In case of <code>dev-libs/popt</code> and <code>sys-apps/shadow</code> updating to the testing version solved this, but for some other packages adding <code>eautoreconf</code> to their ebuilds was necessary.</p>
<p>After these changes the only packages with build errors were <code>dev-libs/apr</code>, <code>net-libs/courier-authlib</code> and <code>net-mail/courier-imap</code>. The first one also fails at the <code>configure</code> check, while the rest uses some C++ code which is not installed.</p>
<p>The modified ebuilds were available in my overlay. I haven’t reported any of the problems observed to the Gentoo Bugzilla, since I cannot access it today and the changes would just make compilation slower in all normal uses.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Difficulties of typesetting quote marks in LaTeX</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/difficulties-of-typesetting-quote-marks-in-latex.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.14</id>

    <published>2009-09-30T16:08:00Z</published>
    <updated>2010-09-16T12:56:11Z</updated>

    <summary>Probably the most complicated to typeset punctuation marks used in English are the quote marks. Although they should be used for short and simple quotations and other simple fragments of text, they are designed for more arcane uses. This combined...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="LaTeX" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="typography" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="logicalmarkuputils" label="logical-markup-utils" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>Probably the most complicated to typeset punctuation marks used in English are the quote marks. Although they should be used for short and simple quotations and other simple fragments of text, they are designed for more arcane uses. This combined with the influence of typewriters makes typesetting them difficult.</p>
<p>Quote marks are used exactly like parentheses&nbsp;– they delimit a fragment of a sentence. But unlike all other such characters, inner quotation marks are different symbols than the outer ones (unless larger outer delimiters in mathematical formulas count as different symbols (they are the most ‘mainstream’ use of parentheses in parentheses)). Another difference is that ‘((’ is easily interpreted in correct way, while ‘“ needs additional spacing (‘ “).</p>
<p>Another problem is that each language has different quote marks. American English uses double outer quotes and inner single quotes, British English uses them as inner and outer, Polish has low double opening quote and English double closing quote, the inner quotes are the French ones, although they are rarely used correctly. American English also includes following commas and periods in the quotes, obviously this would lead to <a href="http://catb.org/jargon/html/writing-style.html" title="Jargon File: Hacker Writing Style">problems in programming-related texts</a>.</p>
<p>LaTeX does not solve these problems, but allows direct specification of appropriate symbols. The English quote marks are represented as <code>``</code>, <code>''</code>, <code>`</code> and <code>'</code>, since these are the nearest equivalents on a typical keyboard. The <code>"</code> character is not used and the space between quotes must be specified as <code>\,</code> (it could be specified in the font, but this would require separate sets of fonts for American and British texts, I’m not sure if it could support third level nested quotes in any of these dialects).</p>
<p>It would be interesting to use just the <code>"</code> character and let the software decide which quotes are opening and which are closing. But even without support for nested quotations this would be difficult (if possible) to do correctly in all cases. A naïve algorithm would just begin with an opening quote and then cycle between closing and opening ones. But this won’t interpret correctly quotes in multi-paragraph dialogue, where each paragraph begins with an opening quote (in Polish dashes are used instead of quotes for dialogue and there is no possibility for humans to interpret multi-paragraph dialogue correctly without backtracking). A common mistake in delimiting block quotations with quote marks may result in a paragraph containing only a closing quote mark, so this algorithm cannot be improved by just resetting to opening quote at each new paragraph.</p>
<p>Emacs uses a different algorithm in the <code>TeX-insert-quote</code> function. It puts opening quotes after whitespace or opening parenthesis. This method could not be implemented in LaTeX, but it can be done in language-specific fonts. But this algorithm fails when quoting spaces or parentheses, like ‘(’, which is commonly done in programming-related texts.</p>
<p>The only problem which can be easily solved is which quotes to use. I have written a LaTeX package for this, named <code>quoted</code> (available in <a href="http://hg.mtjm.eu/logical-markup-utils/">my Mercurial repository</a>), but it does not support spaces between quote marks of different levels or moving punctuation to the quotation. There are probably many better packages for this, but this will not make a useful document ‘portable’ between e.g. British and American dialects of English, so such packages aren’t very useful.</p>
<p>Since parentheses are similar to quotes, but simpler, maybe a single character in source files could be used for them. In times of typewriters a slash was sometimes used instead of parentheses, since it looks similar. Is it possible to implement a LaTeX macro or virtual font replacing <code>/</code> by a slash or appropriate parenthesis depending on context?</p>]]>
        
    </content>
</entry>

<entry>
    <title>Three HTML elements improving document usability</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/three-html-elements-improving-document-usability.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.27</id>

    <published>2009-09-20T14:42:00Z</published>
    <updated>2010-02-13T19:46:08Z</updated>

    <summary>One of the main advantages of the Web is that nearly everyone can use it. The same document may be rendered in very different ways on different devices. This is the reason why HTML, the markup language used for most...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="WWW" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="html" label="HTML" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="usability" label="usability" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>One of the main advantages of the Web is that nearly everyone can use it. The same document may be rendered in very different ways on different devices. This is the reason why <acronym title="Hypertext Markup Language">HTML</acronym>, the markup language used for most text on <acronym title="World Wide Web">WWW</acronym>, specifies semantics of documents instead of their appearance. Therefore many tags and attributes in HTML are not visually significant, but they can it easier to get useful information from the text. This post lists three common things which can be improved with such elements.</p>
<dl>
<dt>acronyms</dt>
<dd><p>Many text use large numbers of acronyms and abbreviations, but their meaning is not always remembered by the readers. Many acronyms also have more than one meaning, e.g. technical texts about <acronym title="Advanced Micro Devices">AMD</acronym> <acronym title="Graphics Processing Unit">GPU</acronym> support on <acronym title="GNU’s Not Unix">GNU</acronym>/Linux use the DRM acronym in two meanings – one <a href="http://en.wikipedia.org/wiki/Direct_Rendering_Manager" title="Wikipedia: Direct Rendering Manager">very useful</a> and one <a href="http://en.wikipedia.org/wiki/Digital_rights_management" title="Wikipedia: Digital rights [sic] management">very harmful</a>.</p>
<p>The solution is to specify the meaning using the <code>title</code> attribute of the <code>acronym</code> element, in the GPU example it would be:</p>
<div class="syntax"><pre><span class="nt">&lt;acronym</span> <span class="na">title=</span><span class="s">&quot;Direct Rendering Manager&quot;</span><span class="nt">&gt;</span>DRM<span class="nt">&lt;/acronym&gt;</span>
allows more optimal use of modern hardware.
</pre></div>

<p>I use this element for first use of each acronym in all posts on my blog. The <a href="http://www.w3.org/TR/html401/struct/text.html#h-9.2.1" title="9.2.1 Phrase elements: EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, and ACRONYM">HTML 4.01 specification</a> describes also the <code>abbr</code> element used for abbreviations. I’m not sure which one of them should be used in which situation.</p>
</dd>
<dt>link titles</dt>
<dd><p>It is nice to know where a hyperlink leads. Therefore it should be appropriately described, by text and additional information provided using the <code>title</code> attribute. Some sites have readable <acronym title="Uniform Resource Identifier">URI</acronym>s, but they should not be the only information allowing a user to decide if the page linked to is used. Using the same example as previous, link with a title may be written in HTML as</p>
<div class="syntax"><pre><span class="nt">&lt;a</span> <span class="na">href=</span><span class="s">&quot;http://en.wikipedia.org/wiki/Direct_Rendering_Manager&quot;</span>
   <span class="na">title=</span><span class="s">&quot;Wikipedia: Direct Rendering Manager&quot;</span><span class="nt">&gt;</span>very useful<span class="nt">&lt;/a&gt;</span>
</pre></div>

<p>Many guidelines for using link titles are specified in <a href="http://www.useit.com/alertbox/980111.html" title="Using Link Titles to Help Users Predict Where They Are Going">Alertbox</a> by Jakob Nielsen. The simplest rules to follow when using link titles is to not duplicate nearby information in them and to provide name of the resource linked to (very useful when the context does not specify this and the URI is numeric).</p></dd>
<dt>definition lists</dt>
<dd><p>In this list it is probably useful to quickly scan names of its elements and read the more useful ones. Definition lists are formatted for such use. In HTML they are specified using three elements – <code>dl</code> contains the whole list, the elements of which are <code>dt</code> containing the defined term and <code>dd</code> containing the definition (each term may have many definition and each definition may have many terms).</p>
<p>Unlike the previous elements, definition lists are equally useful in print. This might make them popular and easy to use correctly, although commonly itemized lists with term and definition separated by a dash are used instead. In my opinion the lack of support for definition lists in a popular word processing package contributes to this (fortunately LaTeX and wikis have equally good support for this as for the other types of lists).</p>
<p>They clearly should be used for definitions, but the <a href="http://www.w3.org/TR/html401/struct/lists.html#h-10.3" title="Definition lists: the DL, DT, and DD elements">HTML specification</a> suggests using them for dialogue, although there are <a href="http://24ways.org/2007/my-other-christmas-present-is-a-definition-list" title="24 ways: My Other Christmas Present Is a Definition List">arguments against it</a>.</p></dd>
</dl>
<p>These elements have also a disadvantage – sites without them are probably even less usable for people who correctly specify acronyms, link titles and definition lists.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Making dashes from hyphens in LaTeX</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/making-dashes-from-hyphens-in-latex.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.15</id>

    <published>2009-09-14T17:51:00Z</published>
    <updated>2010-09-16T12:59:58Z</updated>

    <summary>Probably many users of LaTeX (including me) learned that dashes and hyphens look differently from texts about LaTeX. Many people, supported by keyboards limited to ASCII with some national and unused characters, write only hyphens, with various spacing around them,...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="LaTeX" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="logicalmarkuputils" label="logical-markup-utils" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>Probably many users of LaTeX (including me) learned that dashes and hyphens look differently from texts about LaTeX. Many people, supported by keyboards limited to <acronym title="American Standard Code for Information Interchange">ASCII</acronym> with some national and unused characters, write only hyphens, with various spacing around them, instead of dashes. Could a LaTeX user just include such text in their document and have correctly distinguished hyphens and dashes in the output? This post describes an attempt in this direction.</p>
<p>LaTeX already uses the ASCII hyphen character for both hyphens, minuses and dashes. If it is used in math mode, then it is a minus. Otherwise, <code>-</code> becomes a hyphen, <code>--</code> an endash and <code>---</code> becomes an emdash. The difference between endashes and emdashes lays only in their appearance, different languages require different ones with different spacing. This is the reason why I wrote the <code>onedash</code> package providing a single command, <code>\dash</code>, for typesetting the correct dash in the language and style of the document.</p>
<p>My new package, <code>hyphdash</code>, makes a dash or hyphen from a single hyphen with correct spacing. Both it, <code>onedash</code> and <code>quoted</code> (an equivalent of <code>onedash</code> for quotes) are available in <a href="http://hg.mtjm.eu/logical-markup-utils/">my Mercurial repository</a>. They are licensed under the GNU General Public License, version&nbsp;3 or later.</p>
<p>Hyphens have two uses&nbsp;– they appear in compound words and in words divided across lines. Fortunately, the second use is done automatically by LaTeX and does not affect writing the package. Compound words do not have any spaces before the hyphen, but in lists like ‘mono- and polycrystals’ they may be followed by a space.</p>
<p>Dashes are sometimes surrounded by equal spaces&nbsp;– like in the British style used on this blog&nbsp;– or without spaces—like in the American style used in this sentence—or by unequal spaces. Usually the left space is unbreakable. This package assumes that dashes are surrounded by any normal spaces, i.e. input characters interpreted as spaces by TeX. (Unbreakable spaces appear more arcane than dashes or even inner quote marks, so they are unsupported in input by this package, but the output will have them.) TeXnical reason for this will be stated later.</p>
<p>My package does nearly all of its work in a macro which the hyphen made active character is defined to (see <a href="http://hg.mtjm.eu/logical-markup-utils/file/tip/README">the packages’ README file</a> of information about using this package). This expands to <code>\relax</code> followed by a normal hyphen if math mode is used (the <code>\relax</code> is probably useful in tables). The same result is obtained in horizontal mode if the current font have nonpositive space stretch parameter which probably occurs only for typewriter fonts.</p>
<p>In vertical mode a special dash is used, useful for representing dialogue in Polish texts (English use quote marks for this, making it easier to determine where a multiparagraph speech ends, and making inner quotes common). Probably no word begins with a hyphen, so this is used. Unix and GNU programs have commandline options beginning with a hyphen, but they are typeset in typewriter type (so there is a special case for it).</p>
<p>The complex part lays in the horizontal mode. Hyphens do not have leading spaces, so the are made if <code>\lastskip</code> does not contain positive value. So the common incorrect form of dash, <code>alpha- beta</code> will be kept as a hyphen. In other cases a dash is made. This ignores the possibility of having numeric ranges with endashes, like ‘69–105’. Instead, <code>69-105</code> will use a hyphen and <code>69 - 105</code> will use a much more incorrect (but more probable to be included in the input?) dash. Detecting digits before a hyphen is impossible without making all characters active (this could work only for verbatim typesetting of files, but this does not need dashes), and detecting them after the hyphen won’t distinguish such cases as ‘69–105’ and ‘2-chloro-3-methylpentane’.</p>
<p>There is also another problem&nbsp;– dashes represented by multiple hyphens. The first one will be made a dash, but the following ones (since there is no preceding space) will become hyphens. The standard ligatures for dashes will not be used. Maybe the macro could detect following hyphens, but ‘simple’ solutions like <code>\@ifnextchar</code> used for optional parameters will not work, since the hyphen is a complicated macro instead of a character. Changing catcodes (e.g. redefining the hyphen to a character) will not work with texts changing catcodes. The first hyphen could set a conditional to ignore following ones, but it would be difficult to change it to not ignore hyphens in the next dash. The rest is probably more difficult that these solutions. Therefore only a single hyphen may be used with this package as a dash. The macros <code>\textendash</code> and <code>\textemdash</code> may be used instead of multiple hyphens to make a dash character.</p>
<p>Another problem are hyphens used as minuses for numbers interpreted by TeX. For example, <code>\hspace{-1em}</code> will produce strange results and two error messages. In my opinion the only solutions to this are to not use a hyphen in arguments of such commands (e.g. by using the <code>\hyphen</code> macro or by putting all such things into the preamble), or to redefine each primitive TeX command to change the macro making dashes and hyphens (probably it is impossible to detect where such changes should be made). It is obvious why the first solution is used in the package.</p>
<p>This package uses also an example document as an automatic test to find regressions in future versions (I’ve written <a href="http://blog.mtjm.eu/2009/6/15/testing-a-latex-package-for-logical-quote-formatting" title="Testing a LaTeX package for logical quote formatting">previously</a> about this in other packages for dashes and quotes). From the example I learned about most of the limitations of this package described here.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Finding duplicate files using GNU uniq</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/finding-duplicate-files-using-gnu-uniq.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.41</id>

    <published>2009-09-09T16:22:00Z</published>
    <updated>2010-02-13T19:46:09Z</updated>

    <summary>The standard POSIX program uniq may list repeated lines from its output. Its implementation in GNU Coreutils supports listing all occurrences of such lines. How could this be used to list files of the same content located in a given...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="programming" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="system administration" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="shellscript" label="shell script" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="uniq" label="uniq" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>The standard <acronym title="Portable Operating System Interfaces">POSIX</acronym> program <code>uniq</code> may list repeated lines from its output. Its implementation in GNU Coreutils supports listing all occurrences of such lines. How could this be used to list files of the same content located in a given directory?</p>
<p>The solution is to input to <code>uniq</code> with appropriate options a file consisting of lines isomorphic to the contents of files to be compared. Since <code>uniq</code> compares only adjacent lines, this file would have to be sorted.</p>
<p>Therefore a one-to-one function from files to lines should be used. Cryptographic hash functions are treated as such, although they aren’t – they accept any finite byte sequence as input and output a constant size byte sequence. There are many hashes which are now considered insecure enough for important systems (e.g. when it is easy to obtain reverse mappings or different inputs with the same output), but SHA-512 is currently used in such systems. GNU Coreutils have a program called <code>sha512sum</code> computing this hash of files given on the command line, so it can be easily used for this task.</p>
<p>Each output line of this program consists of 512 bit hexadecimal hash (i.e. 128 hexadecimal digits) and the file name, separated by several spaces. Clearly, it would be useful to know the names of repeated files, not only their SHA-512 hashes, so the output of <code>sha512sum</code> will be wholly passed to <code>uniq</code>.</p>
<p>The list of files for <code>sha512sum</code> can be generated by <code>find $dir -type f</code> where <code>$dir</code> is any directory. This command will output each file in this directory or its subdirs. The test <code>-type f</code> requests it to output only regular files, since they are the only ones with content.</p>
<p>The whole pipe listing names of duplicate files is:</p>
<div class="syntax"><pre>find <span class="nv">$dir</span> -type f -print0 | xargs -0 sha512sum <span class="se">\</span>
    | sort | uniq -w 128 -d --all-repeated<span class="o">=</span>separate <span class="se">\</span>
    | sed <span class="s1">&#39;s/^[0-9a-f]\+ \+//&#39;</span>
</pre></div>

<p>The command <code>xargs</code> passes its input as arguments to <code>sha512sum</code>. Since file names may contain spaces, the options <code>-print0</code> of <code>find</code> and <code>-0</code> of <code>xargs</code> will request them to separate the file names with zero bytes to avoid treating a file name with spaces as names of two files.</p>
<p>The arguments of <code>uniq</code> do things described before – <code>-w 128</code> limits the comparison to first 128 bytes of each line, i.e. to the hash, <code>-d</code> omits unique lines from the output, and <code>--all-repeated=separate</code> outputs all repeated lines, separated by blank lines (of these three options only <code>-d</code> is required by POSIX and supported by <code>uniq</code>s used in BSDs). The final <code>sed</code> expression omits the hash from the output which is probably not useful.</p>
<p>Here <code>sort</code> is used only due to the way in which <code>uniq</code> works. It doesn’t look useful to have different files sorted according to their SHA-512 sums. It might be useful to have files in each duplicate group sorted alphabetically, but this probably could be done faster, since multiple sorts of small sequences are faster then a single sort of their sum (here also the output will be probably much smaller than the input – on my system running the above pipe on <code>/usr/share/man</code> gave only about one third of lines of the output of <code>find /usr/share/man -type f</code>, including extra blank lines). Programs like my <a href="http://blog.mtjm.eu/2009/9/4/implementing-an-unsorted-uniq-called-ununiq" title="Implementing an ‘unsorted uniq’ called ‘ununiq’"><code>ununiq</code></a> find all repeated lines of an unsorted input, but it does not support the options necessary for this task.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Porting ununiq to OpenSolaris</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/porting-ununiq-to-opensolaris.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.40</id>

    <published>2009-09-08T16:25:00Z</published>
    <updated>2010-02-13T19:46:09Z</updated>

    <summary>In previous week I began writing ununiq – a program listing unique lines of its output, but unlike uniq using a hash table and treating also nonadjacent lines as duplicates. Today I decided to check if it will compile on...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="programming" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="autotools" label="Autotools" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="c" label="C" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="gnulib" label="Gnulib" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="opensolaris" label="OpenSolaris" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="uniq" label="uniq" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>In previous week I began writing <a href="http://blog.mtjm.eu/2009/9/4/implementing-an-unsorted-uniq-called-ununiq" title="Implementing an ‘unsorted uniq’ called ‘ununiq’"><code>ununiq</code></a> – a program listing unique lines of its output, but unlike <code>uniq</code> using a hash table and treating also nonadjacent lines as duplicates. Today I decided to check if it will compile on an operating system not using the GNU C library – OpenSolaris.</p>
<p>Using virtualization software like KVM or <a href="http://virt-manager.org/">Virtual Machine Manager</a> it is easy to access several free operating systems at once, so (after some networking-related voodoo) I installed OpenSolaris 2009.06 on a virtual machine without any problem. After installation I found that I cannot log in to the system with empty password (I do not see any security advantage of a password on such virtual machine), so I installed it again with some trivial passwords. I called this new system <em>Laurelin</em>.</p>
<p>Then on Laurelin I installed GNU Autoconf and Automake in newest available versions, i.e. several releases older than on Gentoo. GLib also was trivial to install. Then I tried to use <code>autoreconf</code>, but it didn’t work, since unlike Gentoo OpenSolaris does not link <code>aclocal-1.10</code>, <code>automake-1.10</code>, etc, to files without version numbers. Most probably there are better solutions, but for this one use it was easier to just call these programs directly.</p>
<p>This showed the first problem – other systems do not have newest Autotools – so packages should support the elder ones or provide files generated by newer versions, like <code>configure</code>, <code>Makefile.in</code> and many others. In this case it was easier and probably better to remove unnecessary Automake options introduced in 1.11 and allow Autoconf 2.61.</p>
<p>Then running <code>./configure</code> showed another problem – the shell found syntax errors in unexpanded M4 macros. This was caused by missing <code>pkg-config</code>, since my package did not include its macros and it wasn’t required by GLib in the OpenSolaris package system. The solution was to install <code>pkg-config</code>. Such problems make source-based operating system distributions more appropriate for programming, since they need such packages to install GLib.</p>
<p>The linker found the next problem, lack of the <code>getline()</code> function. My <code>configure</code> script did not check for it, since it wouldn’t do anything with this. Although this function is included in the newest <acronym title="Portable Operating System Interfaces">POSIX</acronym>, GNU C library treats it as a GNU extensions and other C libraries do not support it. The solution was to check how bad the portable alternatives are and choose one of them.</p>
<p>Therefore I added to the package an implementation of <code>getline()</code> from <a href="http://www.gnu.org/software/gnulib/" title="The GNU Portability Library">Gnulib</a>. Then the package compiled successfully on both OpenSolaris and Gentoo.</p>
<p>A test suite checked if this package works correctly with a small subset of its possible options and only one input file. Porting software to other systems should begin with writing a more complete test suite, but this still was enough to improve some parts of the program.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Converting LaTeX to HTML</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/converting-latex-to-html.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.26</id>

    <published>2009-09-07T16:12:00Z</published>
    <updated>2010-02-13T19:46:08Z</updated>

    <summary>Documents typeset with LaTeX are usually shared electronically in PDF files or printed on dead trees. Since on the Web it is clearly better to use HTML instead of PDF, it may be useful to make these documents available in...</summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="LaTeX" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="html" label="HTML" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="tex4ht" label="TeX4HT" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>Documents typeset with LaTeX are usually shared electronically in <acronym title="Portable Document Format">PDF</acronym> files or printed on dead trees. Since on the Web it is clearly better to <a href="http://www.useit.com/alertbox/20010610.html" title="Avoid PDF for On-Screen Reading">use <acronym title="Hypertext Markup Language">HTML</acronym> instead of PDF</a>, it may be useful to make these documents available in HTML. This post lists the problems associated with such conversion and programs trying to solve these problems.</p>
<h4>Limitations</h4>
<p>The <a href="http://www.tex.ac.uk/cgi-bin/texfaq2html?label=LaTeX2HTML" title="Conversion from (La)TeX to HTML">UK TeX FAQ</a> lists three main problems:</p>
<ul>
<li>exact page formatting cannot be represented in HTML</li>
<li>mathematics can be represented only as bitmaps, tables with symbol font, or in MathML; each of these is not supported by every browser and except for MathML they are slow due to the amount of data transferred</li>
<li>not all converters support custom macros</li>
</ul>
<p>The first problem makes some TeX documents impossible to represent in HTML in a useful way. For example, the following document’s text depends on its layout:</p>
<div class="syntax"><pre><span class="k">\documentclass</span><span class="nb">{</span>article<span class="nb">}</span>
<span class="k">\usepackage</span><span class="na">[a4paper]</span><span class="nb">{</span>geometry<span class="nb">}</span>
<span class="k">\usepackage</span><span class="nb">{</span>lipsum<span class="nb">}</span>

<span class="k">\begin</span><span class="nb">{</span>document<span class="nb">}</span>
<span class="k">\lipsum</span><span class="na">[1]</span>

<span class="k">\edef\nlines</span><span class="nb">{</span><span class="k">\the\prevgraf</span><span class="nb">}</span>

The previous paragraph has <span class="k">\nlines\ </span>lines.
<span class="k">\end</span><span class="nb">{</span>document<span class="nb">}</span>
</pre></div>

<p>The text changes when the paper size is changed to e.g. A5. Clearly, any text in HTML would be different than this or false on at least some systems. Therefore the following programs will be tested just on a typical document, looking slightly like a mathematical book.</p>
<h4>Comparing the programs</h4>
<p>Of these listed at the <a href="http://www.tex.ac.uk/cgi-bin/texfaq2html?label=LaTeX2HTML" title="Conversion from (La)TeX to HTML">UK TeX FAQ</a> four are available in Gentoo GNU/Linux (so probably other operating systems’ package managers also allow them to be installed easily). Since two of them use tables for mathematical expressions, I haven’t used them. The two others are LaTeX2HTML and TeX4HT</p>
<p>As the UK TeX FAQ states, LaTeX2HTML is a Perl script does not use TeX to process the files to convert. Therefore it does not support some packages and completely ignores macros of my <a href="http://blog.mtjm.eu/2009/7/5/typesetting-acronyms-in-latex" title="Typesetting acronyms in LaTeX"><code>acronyms</code></a> package. I could try to read the LaTeX2HTML manual to check if it supports loading LaTeX packages, or if I should write a special version of it for this. But I thought that it would be simpler to try just TeX4HT.</p>
<p>Although it ignores unknown commands, it passes unknown environments to LaTeX. Therefore all theorems in my document are rendered to bitmaps and references to them do not work correctly. Their layout is also strange. It would be possible to change LaTeX2HTML to support such environments, but it would be simpler to try other alternatives first.</p>
<p>TeX4HT differs from most similar programs by using TeX to interpret LaTeX code correctly. Therefore it understood my acronyms macros correctly and it would be possible to write a package which would typeset them differently for HTML. Theorems and the ‘ł’ in my name were typeset correctly (the input file used UTF-8 encoding, rarely supported by programs interpreting LaTeX input).</p>
<p>Mathematical formulas still were typeset (into bitmaps) incorrectly, with missing symbols and some parts converted to HTML (so they were not aligned on the math axis).</p>
<p>Therefore, if I will try to convert a LaTeX document to HTML, I will probably use TeX4HT and try to determine why the formulas in my sample document are not converted correctly.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Implementing an ‘unsorted uniq’ called ‘ununiq’</title>
    <link rel="alternate" type="text/html" href="http://blog.mtjm.eu/2009/09/implementing-an-unsorted-uniq-called-ununiq.html" />
    <id>tag:mtjmblog.nfshost.com,2009://2.11</id>

    <published>2009-09-04T19:03:00Z</published>
    <updated>2010-09-16T12:59:09Z</updated>

    <summary><![CDATA[In this week I described one specific features of the standard Unix program uniq&nbsp;– it counts only adjacent identical lines as duplicates, although it would be technically interesting to implement such program without this restriction while not sorting the input....]]></summary>
    <author>
        <name>Michał Masłowski</name>
        
    </author>
    
        <category term="programming" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="software" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="system administration" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="c" label="C" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="glib" label="GLib" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="uniq" label="uniq" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.mtjm.eu/">
        <![CDATA[<p>In this week I described one specific features of the standard Unix program <code>uniq</code>&nbsp;– it counts only adjacent identical lines as duplicates, although it would be <a href="http://blog.mtjm.eu/2009/9/1/listing-unique-lines-of-an-unsorted-file-or-pipe" title="Listing unique lines of an unsorted file or pipe">technically interesting</a> to implement such program without this restriction while not sorting the input.  Then I wrote three implementations of such program <a href="http://blog.mtjm.eu/2009/9/2/comparing-implementations-of-unsorted-uniq-in-c-with-glib-c++-with-qt-and-python" title="Comparing implementations of ‘unsorted uniq’ in C with GLib, C++ with Qt and Python">and compared their performance</a> on unpublished private data based on my blog’s access log.</p>
<p>Today I decided to make one of these three programs a nearly replacement for <code>sort | uniq</code> for situations where fast online algorithm without changing the order of input lines is better then possibly smaller memory use. For this I implemented support for some of <code>uniq</code>’s command line options and plan to write a man page.</p>
<p>I named the program <code>ununiq</code>, since this name is easy to pronounce and opposite of <code>uniq</code>. The program’s source is available in <a href="http://hg.mtjm.eu/ununiq/">my Mercurial repository</a>. Like most of my programs it is licensed under the GNU General Public License, version 3&nbsp;or later.</p>
<h4>Choosing programming language</h4>
<p>The choice between three implementations of the basic algorithm&nbsp;– one in C with GLib, one in C++ with Qt and one in Python&nbsp;– was obvious for me. I wanted it to be:</p>
<ul>
<li>faster than <code>sort | uniq</code> (all three programs satisfied this on the test data)</li>
<li>easy to write and maintain</li>
<li>available on all of my computers without installing too much dependencies (so without Qt, my server doesn’t need it).</li>
</ul>
<p>Performance of my program in C with minimal C++ and Qt was worse than the one in C with GLib, so instead of using more C++ I decided to use the program written in C with GLib.</p>
<h4>Command line options of <code>uniq</code></h4>
<p>Before this project I used an option of <code>uniq</code> only once, when a complicated pipe counting the number of secondary schools in each province from a <acronym title="Comma Separated Values">CSV</acronym> file contained <code>sort | uniq -c</code>. Now I know how to use other options of this program.</p>
<p>Now my program supports input and output with named files, and also skipping several leading characters of each line (the <code>-d</code> option). I implemented also the <code>-d</code> option which outputs only repeated lines, but it writes them on the second occurrence. The program would be more complicated with this option while preserving the original order, so it currently ignores this order.</p>
<p>The options are accepted both as short options and GNU long ones. Their support was very easy to implement using <a href="http://library.gnome.org/devel/glib/stable/glib-Commandline-option-parser.html">GLib command line option parser</a>. Long options have the same names as in GNU Coreutils.</p>
<h4>Future</h4>
<p>Several currently not implemented options will require reading whole input before starting output, e.g. <code>-c</code> which counts how many times a line appears in the input. It should still be faster than <code>sort | uniq</code>. A similar observation may be made when sorting the output of <code>ununiq</code>, when operating on the <code>perltoc</code> man page <code>ununiq | sort</code> is faster by half than <code>sort | uniq</code>, although both make the same output.</p>
<p>The program needs also more documentation and testing. It would be nice with automatic testing of all options, which would also be very helpful for porting to other operating system.</p>]]>
        
    </content>
</entry>

</feed>

