Recently in software Category

Implementing an ‘unsorted uniq’ called ‘ununiq’

| No Comments | No TrackBacks

In this week I described one specific features of the standard Unix program uniq – it counts only adjacent identical lines as duplicates, although it would be technically interesting to implement such program without this restriction while not sorting the input. Then I wrote three implementations of such program and compared their performance on unpublished private data based on my blog’s access log.

Today I decided to make one of these three programs a nearly replacement for sort | uniq for situations where fast online algorithm without changing the order of input lines is better then possibly smaller memory use. For this I implemented support for some of uniq’s command line options and plan to write a man page.

I named the program ununiq, since this name is easy to pronounce and opposite of uniq. The program’s source is available in my Mercurial repository. Like most of my programs it is licensed under the GNU General Public License, version 3 or later.

Choosing programming language

The choice between three implementations of the basic algorithm – one in C with GLib, one in C++ with Qt and one in Python – was obvious for me. I wanted it to be:

  • faster than sort | uniq (all three programs satisfied this on the test data)
  • easy to write and maintain
  • available on all of my computers without installing too much dependencies (so without Qt, my server doesn’t need it).

Performance of my program in C with minimal C++ and Qt was worse than the one in C with GLib, so instead of using more C++ I decided to use the program written in C with GLib.

Command line options of uniq

Before this project I used an option of uniq only once, when a complicated pipe counting the number of secondary schools in each province from a CSV file contained sort | uniq -c. Now I know how to use other options of this program.

Now my program supports input and output with named files, and also skipping several leading characters of each line (the -d option). I implemented also the -d option which outputs only repeated lines, but it writes them on the second occurrence. The program would be more complicated with this option while preserving the original order, so it currently ignores this order.

The options are accepted both as short options and GNU long ones. Their support was very easy to implement using GLib command line option parser. Long options have the same names as in GNU Coreutils.

Future

Several currently not implemented options will require reading whole input before starting output, e.g. -c which counts how many times a line appears in the input. It should still be faster than sort | uniq. A similar observation may be made when sorting the output of ununiq, when operating on the perltoc man page ununiq | sort is faster by half than sort | uniq, although both make the same output.

The program needs also more documentation and testing. It would be nice with automatic testing of all options, which would also be very helpful for porting to other operating system.

Installing ReactOS on KVM virtual machine on Gentoo

| No Comments | 2 TrackBacks

It is useful to access more then one operating system when writing a program designed to be used with a different one. This is one of situations where platform virtualization is helpful. This post describes how I used KVM on a Gentoo GNU/Linux host to use ReactOS on a virtual machine.

In Gentoo’s Portage tree KVM userspace part is available as app-emulation/kvm. By default it uses its own modules, but this can be disabled by emerging it with USE flags havekernel -modules when the kernel has appropriate options enabled.

The ebuild states that user running KVM must be in the kvm group. So I added my user to this group using recommended gpasswd -a <USER> kvm To avoid logging out, I used su - <USER> in a terminal emulator, but this did not solve the problem. Then I read that this package also added rules for udev changing the group of /dev/kvm to kvm. Since I hadn’t restarted udev, I used chgrp kvm /dev/kvm as root to have this set before next reboot. Then KVM worked correctly.

KVM looks simple to use without reading the whole manual. The command kvm-img create c.img 4G made a disk image of $ GiB in the file named c.img, then I could use KVM with this by just specifying this file as a disk of the virtual machine. I used kvm -hda c.img -cdrom ReactOS.iso to install this system. It worked correctly and except for nicer keybindings similar to installation of Microsoft Windows.

After starting the newly installed system I noticed a trivial problem, kvm by default uses UTC time, while ReactOS (like Windows) supports only localtime. After checking it in the man page, I added the -localtime option to this command.

Since network access would be useful, I tried to configure it. Fortunately, KVM has not only a man page but also several HOWTOs on its website, one of them shows how to configure networking. I used user networking with additional options -net nic -net user, since it is the simplest to configure. But ReactOS didn’t have a driver for the default virtual NIC. When I had similar problem with Windows XP on real hardware, I changed the NIC to a different one, then another (from newest to oldest one), then to a one with a CD containing drivers. ReactOS website has a list of supported NICs which shows that most of them require downloading non-free drivers. But PCnet has included drivers and is supported by KVM, so I changed -net nic to -net nic,model=pcnet and it worked correctly.

Documentation of KVM shows that much more can be done in this case. It is nice to use working software with man pages encouraging learning about it.

Common internationalization problems

| No Comments | No TrackBacks

Some time ago I wrote about localization of software. This post describes some problems in using a program in language other than American English except the two trivial ones – not having a single language used by everyone or a program without localization. It is based on my experience in using free software localized to Polish, but it should apply to some other European inflected languages. Some ‘localization’ mistakes can be easily observed even in English.

In these situations translations are often incorrect:

sentence/title construction
‘Remove icon’ is clearly correct, maybe in English ‘Remove Icon’ would be also accepted. But in Polish ‘Usuń Ikona’ is incorrect. There are two problems here: lack of inflection and incorrect capitalization. In this case the problem is caused by using the normal name of the object with a general removal text. It would be solved by each object having a separate ‘Remove X’ text, e.g. ‘Remove icon’ translated into ‘Usuń ikonę’ (although it won’t make translators avoid using incorrect capitalization in their texts). The GNU Coding Standards show a different example of this.
using a single text for counted objects
‘N comments’ is a good example of this. Even in English I have found programs using the form ‘1 comments’ or ‘N comment(s)’. In Polish it is more difficult with three plurals, as stated by the GNU Coding Standards. Fortunately, for positive numbers the problem is completely solved by e.g. GNU Gettext, although having a different form for zero objects would be still better (e.g. ‘no comments’).
ignoring the grammatical gender
This may occur in construction of text about such objects as icons or floppy disks, but it is commonly found on the Web in texts about users. In English ‘he’ or ‘she’ are rarely used in messages about the user, but in many Indo-European languages nearly everything depends on gender. Fortunately, some software begins to support specifying grammatical gender of its user, like MediaWiki. (It is interesting that many roguelikes require the user to specify their gender, although they support only English.)
non-ASCII punctuation
Again, this problem can be easily shown in English. A common web browser separates its name from the page title by a hyphen while a dash should be used. Our language has also different apostrophes and quote marks than typewriters of our ancestors. For Polish it is more difficult, since even in print inner quote marks are usually put in incorrect order.

There is one simple solution – write a program which uses completely correct English and let translators correct it until it will be correct in other languages.

Comparing publishing with TeX in 1980 and in 2009

| No Comments | No TrackBacks

Several days ago I decided to read some old articles about TeX and check if they are still useful. This post continues this by comparing facts described in another article from the first TUGboat issue with my experiences with modern TeX software.

Ellen Swanson described the general way in which book and journals are prepared, from ideas of their authors to binding, and how TeX makes it cheaper. The article states that typesetting one page took about 1.5 hour before using TeX, in the same TUGboat issue Richard Palais wrote that this takes several seconds for TeX (now it is clearly faster). When the author typesets the paper using TeX much greater savings can be made.

Now these arguments still look valid, but several improvements were made to the TeX using method and new non-TeX systems were designed. Today savings with TeX should be even greater, since:

  • TeX82 is better then TeX78
  • new macro packages (e.g. LaTeX or ConTeXt) make writing papers easier, also support more logical formatting allowing using journal-specific styles in a general way
  • MakeIndex and BibTeX (or their replacements) make non-creative part of making indexes and bibliographies completely automated, journals may make their own styles of these without any changes to the source of the paper
  • instead of on a ‘magnetic tape’ the paper source could be send to the publisher by the Internet
  • diff(1) and patch(1) or a version control system could be used to transfer changes made during e.g. editing
  • there are much more materials for learning TeX (or LaTeX/ConTeXt) then in 1980

Now the biggest competition to this method of publishing are the WYSIWYG word processing programs. In my opinion the following disadvantages make their use more costly than the above method:

  • many of them are non-free are require expensive licenses
  • they are WYSIWYG and may not be automated to the same extend as TeX, requiring more work by humans
  • their data formats make it more difficult to compare different versions of a paper
  • they either do not support logical markup or their users rarely know how to do it, making journal-specific styling more difficult
  • their output depends on the computer used, so both author and publisher will optimize line or page breaks
  • producing high quality output may be more difficult or impossible

So journals typeset with TeX should be cheaper, allowing better funding for really useful work like editing or high quality printing. It is best when all papers are typeset with TeX by the authors, but it is also beneficial if only some are and the rest is written with word processors and typeset in TeX by the publisher.

The advantages of printed over electronic documents

| No Comments | No TrackBacks

Using printed documents clearly have important drawbacks. They are produced from murdered trees (so more papers about global warming are printed), are difficult to search (maybe except several books with useful indices) and occupy physical space. Also, printed medium encourages writing useless texts. However, they still cannot be replaced by electronic documents.

PDF and similar file formats represent pages exactly as printed. But they represent whole documents differently than a sequence of pages. Operations like merging several documents into one or dividing one into several with some pages are trivial with printed documents, but commonly used software does not support them (except for printing some pages).

Documents are not the only structure causing problems unknown in printed media – pages also lead to difficulties. For books small pages with text are put on larger pages to be printed, binded and cut. Therefore, a document has both logical and physical pages, which are different in large documents (reading a two column article on a screen where only a part of page’s height is visible looks similar to this problem). Also, at least some software for merging logical pages into a physical one tries to render documents in device-dependent ways – making the document unsuitable for viewing on screen, printing, or both.

Another problem with physical pages of many logical pages is that the user may prefer other combinations. For example, a document with two A4 pages on one physical page is not suitable for users with printers supporting only A4 (if the document contains text, it will be much smaller then expected). Of course, even with a single page there are problems with page scaling. Americans use Letter paper while Europeans use A4. Software assumes a mix of these formats which will scale pages and add useless whitespace (or crop out some text). A common, but not too harmful, sign of such problems in uneven margin (A4 and Letter have different widths).

The problems with page formats may be solved in two ways – providing the document printed on appropriate paper, or providing PDFs in both formats. Clearly, the second way requires formatting independent of page format, impossible with WYSIWYG software.

There are cases of documents printed, corrected and photocopied for publication (or printed again with new information on the same page). This work could be completely automated with PDFs edited in a programmable way, e.g. using pdfTeX. But this would require changes in the habits of users which could be better spend to avoid using printable documents, since hypertext is better and does not have these problems.