July 2009 Archives

Common mistakes of authentication on the Web

| No Comments | No TrackBacks

Today many people use many online services. Each service wants to identify the user. Therefore they need to check if a human uses them, and which human it is. But this checks aren’t always correct.

Many real world security systems are designed to be seen by humans who pay for them. Clearly, this criterion prefers solutions difficult to humans over solutions difficult for bots, since these services may be simpler to distinguish by a human. A nice example of this is a CAPTCHA. It is clearly a problem for humans, I usually need three tries to correctly read text from a CAPTCHA.

For bots CAPTCHAs are not always difficult. Some are designed to be difficult to read by humans, since this may be easily considered ‘secure’, but also easily readable by bots. The reCAPTCHA sites list several examples of such snake oil CAPTCHAs, I have seen one of them at a site of one of the most well-known technical universities in Poland. It wasted human time, but not very much – sometimes it was the same as previously. Clearly, this wasn’t useful.

A CAPTCHA could be necessary on that site, since they generated easy to guess passwords and usernames. On everything else I use long passphrases or the output of head -c6 /dev/random | base64 which produces clearly better passwords than five lowercase letters generated by the technical university. Of course, even five lowercase letters password is more secure than five lowercase letters password sent in an unencrypted email. It is best when the user may write any username and password, just like many other universities allow them to do.

When the user has a password, they may forget it (or forget where they had written this password). Then there are several solutions. Some services allow them to answer questions which they have written previously. These question may be trivial to answer, so I use separate outputs of head -c12 /dev/random | base64 as the question and answer (16 random characters question answered by another 16 random characters). Other services send emails with an URL allowing changing the password. This is not completely secure, since email is insecure, but it may be improbable that someone else will read this email before the URL will be used by the correct user.

The Jacob’s Law of the Internet User Experience stating that ‘users spend most of their time on other websites’ leads to a clear conclusion in this case. The popular ‘solutions’ will be still popular, since people know them. But avoiding the mistakes described in this post should not be a problem for usability – a better CAPTCHA or none is easier to use than a bad one, people usually enter passwords and use emails to reset passwords (although these email are probably not read, since usually they work as expected). It is nice that an organization valuing security or usability may by one decision improve both security and usability.

The simplest solution to most computer problems

| No Comments | No TrackBacks

Today I watched South Park: Bigger, Longer & Uncut. I used a different computer with a DVD player supporting movies from that region, unlike the one of my laptop. The video and music were played correctly, but it was nearly impossible to hear any speech. Then, after observing the same problem with another movie, I unplugged the headphones and plugged them again. Then it worked.

This method may be used to solve many problems with computer hardware or software. There is even a koan about power-cycling a Lisp machine – the designer of the machine said that this method won’t work, but he fixed the machine just by turning it off and on.

Several times the wireless network card of my laptop stopped working until a complete cold boot. Suspend to disk didn’t solve the problem, but turning the machine off, waiting several minutes and turning it on again worked. (I didn’t report this problem to the maintainers of the driver since I don’t know how to reproduce this bug.)

For most network problems on my other computer it is enough to unplug the Ethernet cable and plug it again. Although at least two times this didn’t help, but a reboot fixed this problem.

Usually people whom I know do not ask for help when a problem occur, but reboot the faulty computer and resume their work without observing the problem again. Of course, this wouldn’t help in critical systems, but this is still a reason to design reliable hardware and software instead of these faulty boxes with short uptime.

The advantages of printed over electronic documents

| No Comments | No TrackBacks

Using printed documents clearly have important drawbacks. They are produced from murdered trees (so more papers about global warming are printed), are difficult to search (maybe except several books with useful indices) and occupy physical space. Also, printed medium encourages writing useless texts. However, they still cannot be replaced by electronic documents.

PDF and similar file formats represent pages exactly as printed. But they represent whole documents differently than a sequence of pages. Operations like merging several documents into one or dividing one into several with some pages are trivial with printed documents, but commonly used software does not support them (except for printing some pages).

Documents are not the only structure causing problems unknown in printed media – pages also lead to difficulties. For books small pages with text are put on larger pages to be printed, binded and cut. Therefore, a document has both logical and physical pages, which are different in large documents (reading a two column article on a screen where only a part of page’s height is visible looks similar to this problem). Also, at least some software for merging logical pages into a physical one tries to render documents in device-dependent ways – making the document unsuitable for viewing on screen, printing, or both.

Another problem with physical pages of many logical pages is that the user may prefer other combinations. For example, a document with two A4 pages on one physical page is not suitable for users with printers supporting only A4 (if the document contains text, it will be much smaller then expected). Of course, even with a single page there are problems with page scaling. Americans use Letter paper while Europeans use A4. Software assumes a mix of these formats which will scale pages and add useless whitespace (or crop out some text). A common, but not too harmful, sign of such problems in uneven margin (A4 and Letter have different widths).

The problems with page formats may be solved in two ways – providing the document printed on appropriate paper, or providing PDFs in both formats. Clearly, the second way requires formatting independent of page format, impossible with WYSIWYG software.

There are cases of documents printed, corrected and photocopied for publication (or printed again with new information on the same page). This work could be completely automated with PDFs edited in a programmable way, e.g. using pdfTeX. But this would require changes in the habits of users which could be better spend to avoid using printable documents, since hypertext is better and does not have these problems.

When software should be localized?

| No Comments | No TrackBacks

Localization (commonly referred to as ‘l10n’) is the availability of software messages in more than one language. Clearly, it would be simpler with one language used by everyone, but this is clearly a problem beyond the scope of this blog, so I limit this post to problems with localization, not its cause.

Clearly, the most obvious problem is lack of localization. Most software is written in English, but for many people other languages are more appropriate. But should everything be translated? In my opinion it depends on the software:

templates for text
things like text inserted to documents by LaTeX packages or publicly visible messages of web applications clearly must be in the same language as text (it may be different in multilingual documents)
software with a GUI
using these in non-native languages may be slower for some people, for others it may make the software more difficult (or impossible) to use; hence a translation should be provided (except for Emacs which currently does not support translated messages, the programs with GUIs which I usually use have at least partial Polish translations enabled)
command-line interfaces
now these are probably used only by people who read English documentation or participate in English fora where showing non-English software messages may lead to getting less answers, so a translation is less necessary; also a good command-line interface does not use too much text
programming languages
clearly writing a program in languages other than English makes it much less useful, hence it is explicitly stated in coding standards for e.g. the GNU Project (I write all my programs in English except some unpublished LaTeX code for documents written in Polish)

The above opinions assume that the localization is perfect. This is obviously false, but free software using Gettext or Qt do not have the largest problems with localization (e.g. I haven’t used a non-free program with correct support for the three plural forms in languages like Polish).

My previous post described the elements of a horizontal list in TeX. This one will describe the elements which are broken into pages and some improvements which should be now more possible than in 1980s when TeX was implemented.

The Chapter 15 of The TeXbook by Donald E. Knuth explains the page breaking algorithms of TeX and how they may be used to produce beautiful books. The first paragraph (page 109) states that page breaking is much more difficult than line breaking, since ‘pages often have much less flexibility than lines do’. Unlike line breaking, which uses the total-fit algorithm enabling optimal breaks of whole paragraphs, for page breaking a first-fit algorithm is used, so only the current page is ‘seen’ by TeX to select appropriate breaks. As Knuth explains (page 110), this design difference is based on the unavailability of enough high-speed memory to store several pages. This was certainly true in 1980s, but now many complete books fit in the modern equivalents of high-speed memory of elder days.

Both vertical and horizontal lists contain boxes, glue, kerns and penalties. I’ve described them previously, there are no interesting differences here except for the direction of typesetting. Whatsits and marks were explained in that post, since they are passed from horizontal lists to vertical.

There are two types of material occurring only in one of these two modes – discretionary breaks are only in horizontal mode, in vertical output routines do special tricks instead; insertions are used to put some material in special places of pages (most commonly footnotes, floating tables and figures). Discretionary breaks in vertical lists would probably simplify some things requiring complicated output routines, for example typesetting indices with the entry text repeated on pages beginning with subentries (a solution using marks is explain in The TeXbook, pages 261–263).

The output routine is one of the new ideas in TeX. It allows nearly arbitrary modifications of the page produced from the vertical list, to a box which is shipped out to an output file. Output routines allow things like multicolumn typesetting, special headers and footers, footnotes and correctly floating figures.

An output routine is so useful in vertical mode, so would something similar in horizontal mode be useful? Lines are just boxes of certain width and shift (chosen by e.g. \parshape), with special glue on both sides (to allow e.g. ragged-right typesetting) and content determined by the total-fit (pdfTeX also adds margin kerning). It would be interesting with an arbitrary TeX token list producing such boxes. It would probably make things like line counting or repeated opening quote mark simpler. It would also determine how nice the line is and possibly change it according to the number of previous lines. Is there a nice TeX solution for typesetting the first line of a paragraph in small caps? According to a TUG interview with Werner Lemberg it is simple in Troff. The ‘line routine’ would make it simple in a TeX-like system.

The line routine would determine the badness of a line, clearly ragged-right text has different optimal breaks than justified one (compare the broken LaTeX ragged text commands with normal justified text; use the ragged2e package instead). In vertical mode the badness of a page break is determined before calling the output routine, but it may decide to change the break. Wouldn’t this be simpler with an output routine called for each feasible break to determine the badness of this break?

There are two possible solutions to the problems of the current page breaking algorithm. One is a total-fit page breaking which would also make a typesetting system simpler (the same total-fit algorithm could be used for both lines and pages). The other one is a better cooperation between line breaking and page breaking (proposed at least once for the NTS, the project which led to e-TeX). Maybe if badness was calculated for a chapter as a whole, things like adjusting \looseness by hand to prevent bad page breaks would be automated in a way not possible with TeX?

Horizontal lists in TeX

| No Comments | No TrackBacks

One of the most important typesetting ideas on which TeX is based is the box/glue/penalty model. It is used both to break paragraphs into lines, and to break lines into pages. Since these processes are similar, lines and pages have similar representations. The aim of this post is to describe how material of a paragraph is represented.

The TeXbook by Donald E. Knuth lists the elements of a horizontal list (the material which is broken into lines and put in horizontal boxes) in Chapter 14, page 94:

  • boxes
  • discretionary breaks
  • whatsits
  • vertical material
  • glue
  • kerns
  • penalties
  • math-on and math-off

Boxes do not need any explanation, they are the visible elements of texts, usually glyphs, rules or their combinations (e.g. a table is usually a box made from simpler boxes). Glue and kerns make whitespace between them. Discretionary breaks allow breaking lines in more complicated ways than just removing whitespace. Penalties control how bad the breaks are. These elements have clear use for the line breaking algorithm. They are the only elements of a horizontal list that I’ve directly met in LaTeX.

Math-on and math-off are the additional whitespace made by \mathsurround. They differ from kerns by not allowing breaking on glue or kerns inside math formulas. So in a new typesetting system they probably could be replaced by a kern and infinite penalties at appropriate places inside the formula.

Glue and kerns look similar (on paper they are the same, white areas between glyphs), but they have two main differences – glue is stretchable and separates words (for automatic hyphenation), while kerns do not change their size and make words unhyphenable. There are two types of kerns – explicit which are directly put by the \kern primitive and implicit which is completely automatic and do not affect hyphenation.

In all of the above differences between glue and kerns, explicit kerns look similar to empty boxes. But there are two important differences – boxes have also vertical dimensions (useful to make proper vertical spacing in tables) and they are not discardable, so a box cannot be removed on a page break while a kern is removed there (imagine a justified paragraph with empty boxes on beginnings or endings of lines, it would be ragged). This is a nice example of how different the elements of a horizontal list are – every one of them is useful, no one may be completely replaced by any other one.

Vertical mode material is put in a horizontal list to be placed between lines produced from the list. This may be used e.g. to put a page break after the current line when it is not known where the line ends. It is used also for marks which are token lists put in the page, the output routine (more on this later) will access some of them. Similarly, whatsits are used when a page is produced, but after the output routine. They are used to write page numbers to files (necessary to make an index), to make right to left text in e-TeX, and to give DVI drivers special commands, e.g. to change colour of text or to make a hyperlink.

Typesetting acronyms in LaTeX

| No Comments | No TrackBacks

This post describes some problems related to using acronyms in typeset text and some solutions to them in the form of LaTeX packages. It does not explain acronyms related to typesetting.

Problems

I’ve noticed three main problems of acronyms:
their meanings are difficult to remember
This is easily solved by appropriate text explaining the meaning, also by margin notes with the meaning or lists of acronyms, etc. For me its not a problem since just the acronym can be the meaning, in my opinion e.g. ‘DVI’ and ‘horse’ have the same rights as words.
frequent use of capital letters makes text less readable
This is easily solved by using acronyms less frequently or by using slightly smaller type for them.
different ideas are represented by the same acronym
Usually the context determines the meaning, for example ‘LSD’ as a substance and as a digit rarely is used in a single work (unless as an example of an acronym, but here the meaning is not significant). This may be a problem when the acronym is associated with certain emotions, for example one DRM is an evil limitation of freedom, while another DRM is a part of software enabling efficient use of GPUs in free operating systems.

Solutions

These problems can be solved in the following ways:

  • rewriting text to use less acronyms – the best way, although beyond the scope of this text
  • using smaller font for acronyms
  • putting the definition of the acronym on first use, e.g. in a margin note or a tooltip (as usually in this blog)
  • including a list of all acronyms used in the work with their definitions

LaTeX packages available on CTAN

I’ve found four related packages on CTAN:

acromake
This package supports defining commands for acronyms. Each will result in the full name and the acronym on first use. On the second use a reference to the definition will be made and next uses will put only the acronym.
acronym
This provides also commands allowing precise selection where full or short names should be used. A list of acronyms is also made. The manual of this package explains how acronyms may use smaller font.
glossaries
This package allows preparation of many glossaries and can be used for lists of acronyms.
glosstex
Another package for typesetting acronyms. It differs by using a program written in C.

My package

Since there are many packages for typesetting acronyms, and I don’t use most of their features, I wrote a new package for this. It is called acronyms and is available from my Mercurial repository. It differs by having evolved from very simple macros for typesetting just the acronym with appropriate spacing and smaller type, adding macros for specific acronyms and support for lists of acronyms. Then I wrote general macros for acronym definitions and added more incomplete features. Now it has optional support for acronym lists using the glossaries package, indexing chosen acronyms, and making margin notes with definitions on first use of each acronym.

Raster graphics file optimization

| No Comments | No TrackBacks

Several weeks ago I wrote about raster graphics file formats used on Web. Today I read an article by Susie Sahim about optimizing such files.

The basic things to do are:

  • crop out blank margins – the video by Sahim shows how to do it ‘by hand’ in a proprietary image editor; the GIMP has special functions to do it automatically
  • use appropriate file format

Usually only three file formats are available for graphics on the Web – GIF, PNG and JPEG. The article states that JPEG should be used for photos and colourful artwork; it does not explain that the lossy compression of this format has negative influence on image quality least visible in these types of artwork.

The article recommends using GIF or PNG images with small number of different colours. Maybe the software used in the video does not have this problem, but most programs which I use make large PNG files. This can be solved by processing them by OptiPNG. OptiPNG does this more slowly than program producing highly unoptimal PNGs, but more time will be saved on transfer of the file. This program also selects optimal colour space and converts similar file formats to PNG.

I don’t use GIFs, but in some cases they are smaller than PNGs produced from them by OptiPNG. The only cases in which I’ve observed this are icons; maybe it is better to put many icons in one file and select then by CSS property background-position?

liability-deltoid