Recently in language Category

Common internationalization problems

| No Comments | No TrackBacks

Some time ago I wrote about localization of software. This post describes some problems in using a program in language other than American English except the two trivial ones – not having a single language used by everyone or a program without localization. It is based on my experience in using free software localized to Polish, but it should apply to some other European inflected languages. Some ‘localization’ mistakes can be easily observed even in English.

In these situations translations are often incorrect:

sentence/title construction
‘Remove icon’ is clearly correct, maybe in English ‘Remove Icon’ would be also accepted. But in Polish ‘Usuń Ikona’ is incorrect. There are two problems here: lack of inflection and incorrect capitalization. In this case the problem is caused by using the normal name of the object with a general removal text. It would be solved by each object having a separate ‘Remove X’ text, e.g. ‘Remove icon’ translated into ‘Usuń ikonę’ (although it won’t make translators avoid using incorrect capitalization in their texts). The GNU Coding Standards show a different example of this.
using a single text for counted objects
‘N comments’ is a good example of this. Even in English I have found programs using the form ‘1 comments’ or ‘N comment(s)’. In Polish it is more difficult with three plurals, as stated by the GNU Coding Standards. Fortunately, for positive numbers the problem is completely solved by e.g. GNU Gettext, although having a different form for zero objects would be still better (e.g. ‘no comments’).
ignoring the grammatical gender
This may occur in construction of text about such objects as icons or floppy disks, but it is commonly found on the Web in texts about users. In English ‘he’ or ‘she’ are rarely used in messages about the user, but in many Indo-European languages nearly everything depends on gender. Fortunately, some software begins to support specifying grammatical gender of its user, like MediaWiki. (It is interesting that many roguelikes require the user to specify their gender, although they support only English.)
non-ASCII punctuation
Again, this problem can be easily shown in English. A common web browser separates its name from the page title by a hyphen while a dash should be used. Our language has also different apostrophes and quote marks than typewriters of our ancestors. For Polish it is more difficult, since even in print inner quote marks are usually put in incorrect order.

There is one simple solution – write a program which uses completely correct English and let translators correct it until it will be correct in other languages.

When software should be localized?

| No Comments | No TrackBacks

Localization (commonly referred to as ‘l10n’) is the availability of software messages in more than one language. Clearly, it would be simpler with one language used by everyone, but this is clearly a problem beyond the scope of this blog, so I limit this post to problems with localization, not its cause.

Clearly, the most obvious problem is lack of localization. Most software is written in English, but for many people other languages are more appropriate. But should everything be translated? In my opinion it depends on the software:

templates for text
things like text inserted to documents by LaTeX packages or publicly visible messages of web applications clearly must be in the same language as text (it may be different in multilingual documents)
software with a GUI
using these in non-native languages may be slower for some people, for others it may make the software more difficult (or impossible) to use; hence a translation should be provided (except for Emacs which currently does not support translated messages, the programs with GUIs which I usually use have at least partial Polish translations enabled)
command-line interfaces
now these are probably used only by people who read English documentation or participate in English fora where showing non-English software messages may lead to getting less answers, so a translation is less necessary; also a good command-line interface does not use too much text
programming languages
clearly writing a program in languages other than English makes it much less useful, hence it is explicitly stated in coding standards for e.g. the GNU Project (I write all my programs in English except some unpublished LaTeX code for documents written in Polish)

The above opinions assume that the localization is perfect. This is obviously false, but free software using Gettext or Qt do not have the largest problems with localization (e.g. I haven’t used a non-free program with correct support for the three plural forms in languages like Polish).

Will we read essays written by computers?

| No Comments | No TrackBacks

After using the ‘random’ comic link several times on XKCD, I found one about the Turing test. When I was an IB DP student some people though that some of my essays were written by computer programs (I have heard similar opinions on nearly every text which I have translated from English to Polish). So if an essay written by a human may not pass the Turing test, may a text written by a computer be considered useful for us?

This is obviously true for most texts, if all textual program output is considered a text. So a stricter definition of text is needed to make this question useful. A standard essay for an English writing exam might be appropriate, since they clearly express several useful criteria, like having interestingly complicated grammar use and discombobulating message with clearly visible personal involvement.

It is clearly difficult to describe an essay in an algorithm. Although clear description of ideas is one of the largest problems in essay-writing, a program converting a trivial description of reasoning into an essay would be useful. Essays involve many examples which should not be the same in every student’s work, so a large database of facts could be used to add examples for some theses.

So with a given message, the essay would be written with many encyclopedic examples and as complicated grammatical structure as foreseen by the authors of the program. From grammar point of view, it is nearly impossible to map an English sentence to an abstract thought representation, but the reverse process, which would be used in the program, would be simple. A problem would occur when the generated sentence has other meanings unknown to the computer, but it is a problem also for human students.

It could be interesting how a program would represent all facts which could be used in an essay. Humans use large collections of useful facts written in the English or equivalent language (formally, languages are not isomorphic due to the Sapir-Whorf’s hypothesis, but all popular languages have the same drawback for this use). Therefore to write it is necessary to read which is too difficult for computers.

Maybe with a formal notation for facts useful in essays and a formal description of an essay, a computer would be able to write a highly marked essay. But I do not believe that for a human it would be simpler to write such program and its data than to write a good essay. (I hope that a computer will quote a part of this blog entry in an essay and explain an opposite point of view.)

Problems with searching

| No Comments | No TrackBacks

Since a single word usually has many meanings, unknown to WWW search engines, it is difficult to find information about software with very common names.

The most popular English word, ‘the’ is also the name of THE multiprogramming system. A more well-known example is TeX, written in a way nearly impossible to typeset with a non-TeX-based system, since the name of TEX was already used then.

This is a much larger problem for search engines which find subsets of words, e.g. treating ‘tex’ as a part of ‘text’. Case-insensitive finding of whole words (with inflected forms) is also popular, but it makes finding information about LaTeX or latex (but not both) more difficult.

This may be partially solved by algorithms like PageRank. A Google search for ‘TeX’ returns on the first page only pages about TeX, for ‘LaTeX’ most are about LaTeX. But it just helps for one particular meaning. For example, it helps finding information about Python, but not a python.

There is a partial solution – find ‘python snakes’ instead of ‘pythons’ (although ‘pythons’ does not refer to the Python programming language, it refers to Monty Python). This clearly does not work in all situations and sometimes it is difficult to determine appropriate keywords.

I found a similar problem when searching for useful texts about From the Teeth of Angels, a novel by Jonathan Carroll. Since Carroll is so popular in Poland, I found mostly links to shops selling the book. When I want to buy a book, I use search engines specific to some shops where finding texts about the books is not a problem.

These problems show that using one search engine for everything leads to most difficulties. Specialized search engines avoid the problem. But is there any other solution?

We represent concepts by words. So instead of words, other representations of concepts should be more useful. For example, the symbol ‘i’ has different uses in complex arithmetic, procedural programming and in Latin (the two last meaning are used nearly everywhere else). We may use a different language with complex descriptions of meaning of each term used. When a review is known as a review, we may know that it is not a shop offer.

Even if it is possible to describe terms useful in real life in an unambiguous way, it would be additional work. Therefore it will not be done in at least some cases, making the problem smaller or the same as without it.