Recently in WWW Category

Using a blog software without server-side scripting

| No TrackBacks

The software previously used on this blog (Zine) keeps the text in a relation database (in this case PostgreSQL) and on each request formats a page using this data and some Python code. Most popular blog software use exactly the same paradigm, although most are not written in Python and many use only MySQL for the database.

The problems with this solution is that the same pages are generated many times and much simpler software could be used for this. On a typical blog updated much less often than viewed exactly identical pages are generated multiply times, using multiply times more resources than necessary. Therefore hosting of typical blogs could be much simpler and cheaper than is possible with such technology.

Therefore very large sites (or sites using very large software) use separate caching servers, like Varnish. Such programs get a page from the original server and keep it for some time, giving it much faster without regenerating the page for next requests. This solution still does not support nicely sites changed for every user (so usually the cache is skipped for 'non-anonymous' users) and it is difficult to avoid giving outdated pages from the cache (at least while the cache is used for unchanged pages). (Another problem is that another daemon must by running on the server, allowing friendly 503 HTTP errors when one of the daemons serving the page does not work.)

All problems of such caching servers could be avoided by correctly generating static files for each page when something is changed. If the files will be generated on a different machine than the server, then it could be written in completely different, maybe better, ways than the software used on the server. The pages will be uploaded to the server and a simpler HTTP server would send them much faster than with any other solution.

This solution would clearly require generation of all uploaded files by a single user with access to the whole site. So there won't be any multiuser things, there will be no search, and the sites will not depend on current time (this is used for relative, friendly dates in texts like 'written five hours ago', this could be easily done client-side using JavaScript). Still, these things look uncommon on a typical blog written by a single user.

The problem is that pages of a simple blog depend usually not only on their own content and other posts but also on comments posted by users. A server-side script is necessary to get the comments, but it won't be a problem since this is exactly what such scripts are for. There are two possible things to do with the comments obtained by the script - adding them to the post page or putting them in a private place from which the user would move the comments to be published on next update. The first solution requires making the page 'less static', but the server-side code would be still much simpler then usually. The second solution is useful also due to useless spam (ignored by both readers, search engines and writers) being sent by malevolent bots as comments.

Therefore such 'offline' blog software would work well enough for small sites. They would be also able to do things which are too slow to be done by advanced server-side scripts, for example checking if the example source code written in a post about the C programming language can be compiled (and maybe even run and generate the output shown in the post). Such software improvements could make higher quality posts easier. Also, as a locally used program, it could be more user-friendly than Web-based solutions (it could e.g. allow using standard Unix tools to correct a typo in many posts, or use more helpful editors than available in a browser).

Maybe it is worth writing such a program (or finding and using an existing one). Certainly the possibility of converting an existing blog to use such software would not be trivial (e.g. to avoid duplicate entries in feed readers and to import all useful data like comments), but it might have more benefits.

Some limitations of popular free Web log analyzer software

| 2 Comments | No TrackBacks

It is useful for a blogger to know how their site is used. Understanding which information the users are searching for, which sites linked them to it, relationship between post’s popularity and weekday, might help making more useful content. But getting such information should not harm the users, i.e. not increase the amount of useless scripts which they must download and not waste time which the blogger might use to write useful texts or to communicate with others.

Sources of data

Most Web servers store in their access logs some data about each request, like the user’s IP address or the referring page URL. There are many formats of such data, but they all share three important things:

  • no additional work is done client-side
  • only data specified in the HTTP headers is used
  • all accesses are logged, including these from robots.

The problem with such data is that it does not specify some information known only by the user’s browser (called more formally an user agent), like screen resolution, support for JavaScript or some useless plugins. Other information coming from the user are trivially forged, malicious bots happily pretend to be real browsers coming by links from other pages.

A partial solution to this is to use JavaScript code and zero-sized images without caching to get these information from the client. But this requires more requests per page view (especially from different servers, these makes page loading much slower), and it ignores users who disable JavaScript or use browser extensions blocking such code for privacy/performance reasons.

Although there is nothing specific to free software, this situation leads to many problems with programs analyzing Web server logs.

Some uses of the data

These are several possible uses of data stored in the access logs:

  • finding which topics are popular and worth expansion
  • comparing posts with search keywords leading users to them, maybe they could be more useful for common visitors
  • blocking access for bots which do not benefit potential users and waste bandwidth
  • finding other blogs linking to the site, they might have useful information on similar topics
  • comparing effectiveness of different posting schedules
  • finding possible problems, like broken incoming links
  • determining how specific browsers or operating systems are popular among the readers

All of these might be used to make the site more useful. The programs should make it easy, but it is not as simple as it seems.

How spam makes it difficult

For most uses only data about human visitors is helpful. Only to block unfriendly bots or to correct technical problems data about bot visits is needed.

The problem is that only the useful bots want to be identified as bots. The ones which send spam, copy content to spam sites, get mail addresses to send spam, spam etc, do not want to be known – this would make it trivial to disallow their visits. So they pretend to use popular Web browsers and use many IP addresses without any clear pattern.

Many spam bots can be easily identified by using identifications of very old browsers (some of which could not access the site due to changes in the Web protocols), or by strange usage patterns like visiting only a single page referring from the same page and not getting any styles or images. They also go to URLs used by insecure Web applications and pretend to visit from certain sites in hope of getting a link to these sites (it is called referrer spam). This spam is useless in most cases, since the referrer URLs are not published on properly written sites excluding ones like password-protected log analyzer reports (with all links marked to be ignored by search engine crawlers). But it still makes the log analyzers less useful.

Problems of common log analyzers

One of the most visible things which I observed after visiting the Wikipedia list of Web log analyzers is that most of them are very old. Of the ones not using MySQL or PHP one had last release in 2004, another does not try to ignore visits by bots in statistics generated, using another one is the main inspiration for this post.

Clearly, identification of new browsers and operating systems, proper determination of queries from new (or renamed) search engines, and detection of malicious bots requires changes in software. So I believe that projects without new releases in this year do not detect new things and have problems making them less interesting to improve.

Another problem is that URLs are usually not unique for a given content, although they should. This is most common with forum software written in PHP, they use different URLs for each user. Therefore log analyzers treat each visit from a forum thread as a visit from a different page. This makes lists of referring URLs much less friendly to humans who are more interested in pages than their specific URLs.

There are probably no perfect solutions for the spam in statistics, but the programs could vastly decrease its amount by trivial measures.

Solutions

There are two methods of solving these problems – correcting an existing program or writing a new one. Since most of free software log analyzers are written in C, which is better for much different programs, or Perl, which is appropriate for much smaller programs and probably encourages committing some of their possible design mistakes, it would be difficult for me. Maybe it would be an interesting learning experience to write another faulty log analyzer?

Three HTML elements improving document usability

| No TrackBacks

One of the main advantages of the Web is that nearly everyone can use it. The same document may be rendered in very different ways on different devices. This is the reason why HTML, the markup language used for most text on WWW, specifies semantics of documents instead of their appearance. Therefore many tags and attributes in HTML are not visually significant, but they can it easier to get useful information from the text. This post lists three common things which can be improved with such elements.

acronyms

Many text use large numbers of acronyms and abbreviations, but their meaning is not always remembered by the readers. Many acronyms also have more than one meaning, e.g. technical texts about AMD GPU support on GNU/Linux use the DRM acronym in two meanings – one very useful and one very harmful.

The solution is to specify the meaning using the title attribute of the acronym element, in the GPU example it would be:

<acronym title="Direct Rendering Manager">DRM</acronym>
allows more optimal use of modern hardware.

I use this element for first use of each acronym in all posts on my blog. The HTML 4.01 specification describes also the abbr element used for abbreviations. I’m not sure which one of them should be used in which situation.

link titles

It is nice to know where a hyperlink leads. Therefore it should be appropriately described, by text and additional information provided using the title attribute. Some sites have readable URIs, but they should not be the only information allowing a user to decide if the page linked to is used. Using the same example as previous, link with a title may be written in HTML as

<a href="http://en.wikipedia.org/wiki/Direct_Rendering_Manager"
   title="Wikipedia: Direct Rendering Manager">very useful</a>

Many guidelines for using link titles are specified in Alertbox by Jakob Nielsen. The simplest rules to follow when using link titles is to not duplicate nearby information in them and to provide name of the resource linked to (very useful when the context does not specify this and the URI is numeric).

definition lists

In this list it is probably useful to quickly scan names of its elements and read the more useful ones. Definition lists are formatted for such use. In HTML they are specified using three elements – dl contains the whole list, the elements of which are dt containing the defined term and dd containing the definition (each term may have many definition and each definition may have many terms).

Unlike the previous elements, definition lists are equally useful in print. This might make them popular and easy to use correctly, although commonly itemized lists with term and definition separated by a dash are used instead. In my opinion the lack of support for definition lists in a popular word processing package contributes to this (fortunately LaTeX and wikis have equally good support for this as for the other types of lists).

They clearly should be used for definitions, but the HTML specification suggests using them for dialogue, although there are arguments against it.

These elements have also a disadvantage – sites without them are probably even less usable for people who correctly specify acronyms, link titles and definition lists.

Common internationalization problems

| No TrackBacks

Some time ago I wrote about localization of software. This post describes some problems in using a program in language other than American English except the two trivial ones – not having a single language used by everyone or a program without localization. It is based on my experience in using free software localized to Polish, but it should apply to some other European inflected languages. Some ‘localization’ mistakes can be easily observed even in English.

In these situations translations are often incorrect:

sentence/title construction
‘Remove icon’ is clearly correct, maybe in English ‘Remove Icon’ would be also accepted. But in Polish ‘Usuń Ikona’ is incorrect. There are two problems here: lack of inflection and incorrect capitalization. In this case the problem is caused by using the normal name of the object with a general removal text. It would be solved by each object having a separate ‘Remove X’ text, e.g. ‘Remove icon’ translated into ‘Usuń ikonę’ (although it won’t make translators avoid using incorrect capitalization in their texts). The GNU Coding Standards show a different example of this.
using a single text for counted objects
‘N comments’ is a good example of this. Even in English I have found programs using the form ‘1 comments’ or ‘N comment(s)’. In Polish it is more difficult with three plurals, as stated by the GNU Coding Standards. Fortunately, for positive numbers the problem is completely solved by e.g. GNU Gettext, although having a different form for zero objects would be still better (e.g. ‘no comments’).
ignoring the grammatical gender
This may occur in construction of text about such objects as icons or floppy disks, but it is commonly found on the Web in texts about users. In English ‘he’ or ‘she’ are rarely used in messages about the user, but in many Indo-European languages nearly everything depends on gender. Fortunately, some software begins to support specifying grammatical gender of its user, like MediaWiki. (It is interesting that many roguelikes require the user to specify their gender, although they support only English.)
non-ASCII punctuation
Again, this problem can be easily shown in English. A common web browser separates its name from the page title by a hyphen while a dash should be used. Our language has also different apostrophes and quote marks than typewriters of our ancestors. For Polish it is more difficult, since even in print inner quote marks are usually put in incorrect order.

There is one simple solution – write a program which uses completely correct English and let translators correct it until it will be correct in other languages.

Common mistakes of authentication on the Web

| No TrackBacks

Today many people use many online services. Each service wants to identify the user. Therefore they need to check if a human uses them, and which human it is. But this checks aren’t always correct.

Many real world security systems are designed to be seen by humans who pay for them. Clearly, this criterion prefers solutions difficult to humans over solutions difficult for bots, since these services may be simpler to distinguish by a human. A nice example of this is a CAPTCHA. It is clearly a problem for humans, I usually need three tries to correctly read text from a CAPTCHA.

For bots CAPTCHAs are not always difficult. Some are designed to be difficult to read by humans, since this may be easily considered ‘secure’, but also easily readable by bots. The reCAPTCHA sites list several examples of such snake oil CAPTCHAs, I have seen one of them at a site of one of the most well-known technical universities in Poland. It wasted human time, but not very much – sometimes it was the same as previously. Clearly, this wasn’t useful.

A CAPTCHA could be necessary on that site, since they generated easy to guess passwords and usernames. On everything else I use long passphrases or the output of head -c6 /dev/random | base64 which produces clearly better passwords than five lowercase letters generated by the technical university. Of course, even five lowercase letters password is more secure than five lowercase letters password sent in an unencrypted email. It is best when the user may write any username and password, just like many other universities allow them to do.

When the user has a password, they may forget it (or forget where they had written this password). Then there are several solutions. Some services allow them to answer questions which they have written previously. These question may be trivial to answer, so I use separate outputs of head -c12 /dev/random | base64 as the question and answer (16 random characters question answered by another 16 random characters). Other services send emails with an URL allowing changing the password. This is not completely secure, since email is insecure, but it may be improbable that someone else will read this email before the URL will be used by the correct user.

The Jacob’s Law of the Internet User Experience stating that ‘users spend most of their time on other websites’ leads to a clear conclusion in this case. The popular ‘solutions’ will be still popular, since people know them. But avoiding the mistakes described in this post should not be a problem for usability – a better CAPTCHA or none is easier to use than a bad one, people usually enter passwords and use emails to reset passwords (although these email are probably not read, since usually they work as expected). It is nice that an organization valuing security or usability may by one decision improve both security and usability.