Some limitations of popular free Web log analyzer software

| 2 Comments | No TrackBacks

It is useful for a blogger to know how their site is used. Understanding which information the users are searching for, which sites linked them to it, relationship between post’s popularity and weekday, might help making more useful content. But getting such information should not harm the users, i.e. not increase the amount of useless scripts which they must download and not waste time which the blogger might use to write useful texts or to communicate with others.

Sources of data

Most Web servers store in their access logs some data about each request, like the user’s IP address or the referring page URL. There are many formats of such data, but they all share three important things:

  • no additional work is done client-side
  • only data specified in the HTTP headers is used
  • all accesses are logged, including these from robots.

The problem with such data is that it does not specify some information known only by the user’s browser (called more formally an user agent), like screen resolution, support for JavaScript or some useless plugins. Other information coming from the user are trivially forged, malicious bots happily pretend to be real browsers coming by links from other pages.

A partial solution to this is to use JavaScript code and zero-sized images without caching to get these information from the client. But this requires more requests per page view (especially from different servers, these makes page loading much slower), and it ignores users who disable JavaScript or use browser extensions blocking such code for privacy/performance reasons.

Although there is nothing specific to free software, this situation leads to many problems with programs analyzing Web server logs.

Some uses of the data

These are several possible uses of data stored in the access logs:

  • finding which topics are popular and worth expansion
  • comparing posts with search keywords leading users to them, maybe they could be more useful for common visitors
  • blocking access for bots which do not benefit potential users and waste bandwidth
  • finding other blogs linking to the site, they might have useful information on similar topics
  • comparing effectiveness of different posting schedules
  • finding possible problems, like broken incoming links
  • determining how specific browsers or operating systems are popular among the readers

All of these might be used to make the site more useful. The programs should make it easy, but it is not as simple as it seems.

How spam makes it difficult

For most uses only data about human visitors is helpful. Only to block unfriendly bots or to correct technical problems data about bot visits is needed.

The problem is that only the useful bots want to be identified as bots. The ones which send spam, copy content to spam sites, get mail addresses to send spam, spam etc, do not want to be known – this would make it trivial to disallow their visits. So they pretend to use popular Web browsers and use many IP addresses without any clear pattern.

Many spam bots can be easily identified by using identifications of very old browsers (some of which could not access the site due to changes in the Web protocols), or by strange usage patterns like visiting only a single page referring from the same page and not getting any styles or images. They also go to URLs used by insecure Web applications and pretend to visit from certain sites in hope of getting a link to these sites (it is called referrer spam). This spam is useless in most cases, since the referrer URLs are not published on properly written sites excluding ones like password-protected log analyzer reports (with all links marked to be ignored by search engine crawlers). But it still makes the log analyzers less useful.

Problems of common log analyzers

One of the most visible things which I observed after visiting the Wikipedia list of Web log analyzers is that most of them are very old. Of the ones not using MySQL or PHP one had last release in 2004, another does not try to ignore visits by bots in statistics generated, using another one is the main inspiration for this post.

Clearly, identification of new browsers and operating systems, proper determination of queries from new (or renamed) search engines, and detection of malicious bots requires changes in software. So I believe that projects without new releases in this year do not detect new things and have problems making them less interesting to improve.

Another problem is that URLs are usually not unique for a given content, although they should. This is most common with forum software written in PHP, they use different URLs for each user. Therefore log analyzers treat each visit from a forum thread as a visit from a different page. This makes lists of referring URLs much less friendly to humans who are more interested in pages than their specific URLs.

There are probably no perfect solutions for the spam in statistics, but the programs could vastly decrease its amount by trivial measures.

Solutions

There are two methods of solving these problems – correcting an existing program or writing a new one. Since most of free software log analyzers are written in C, which is better for much different programs, or Perl, which is appropriate for much smaller programs and probably encourages committing some of their possible design mistakes, it would be difficult for me. Maybe it would be an interesting learning experience to write another faulty log analyzer?

No TrackBacks

TrackBack URL: http://blog.mtjm.eu/cgi-bin/mt/mt-tb.cgi/23

2 Comments

Thank you for the great article. You have made a several very good points. Log analyzers are probably not maintained because most of them do basically same things and there's no much value they give compared to popular free tag-based analytics such is Google Analytics. However, our product Web Log Storming is different from others and we still develop it (though it's not free).

In essence, it allows you to interactively change filters, drill-down into the individual visitor's details, create custom reports, etc. We aim to offer a product flexible enough to eliminate the need to develop your own system.

I suggest you to download a free 30-day trial and see how it fits into your needs. If you wish, I'd like to discuss with you ideas for further improvements, so feel free to contact me for any comments you might have.

Your site has very good arguments against JavaScript-based analyzers and presents very nice uses of the statistics. The only reasons why I would not use such solution is that it is non-free, requires downloading logs to client's computer and does not support the operating system which I use.

My second argument is a consequence of design allowing much better usability of the program than Web applications, and the third one looks like just a result of spending time on features useful for majority of clients.

Your software is probably flexible enough for most common uses, but for any sufficiently complicated problem there is no optimal solution. I believe that analyzing Web logs is such a problem and free software which might be adapted by anyone for uses unforeseen by the original developer might be better. Personally, I just wanted to learn more about such things, and free software allows learning more about how it works.

Leave a comment