May 2009 Archives

Why I didn’t like Gmail and how spam works

| No Comments | No TrackBacks

Update: in December 2009 I decided to use Gmail having instead of all these problems with using my own mail server. The explanation of these problems (excluding power use of the server) still might be useful.

Today Google again decided to block my IP address from sending mail to Gmail users.

In elder days Internet users trusted each other. It is well-known especially from texts about ITS (e.g. OS and JEDGAR from Jargon File). In these times SMTP, the protocol on which Internet mail is based, was developed.

This protocol made spam (named after the Monty Python sketch) very easy – nearly all cost of email was on the receiver side who had to store it or relay to other hosts.

It is clearly visible how harmful is spam. So hosts sending spam were blocked by some receivers. Then spammers used open relays – SMTP servers sending emails from anyone to anyone. Therefore open relays were blocked.

Now spammers use botnets to send spam. Large numbers of hosts with insecure software are used for this. This makes direct fight with spammers impossible. The same botnets have also other uses, some very political.

There is no general way of finding who controls botnets. It would be also too inconvenient to block millions of users.

Since most users do not have their own SMTP servers and their ISPs prefer them to relay mail through their SMTP servers, it is possible to block mail sent directly from spam-sending hosts. So called ‘real time blackhole lists’ (RBLs) are used for that.

There are three main problems with RBLs – organizations controlling them are not entirely neutral, it is difficult to be removed from an RBL, and the same IP addresses are used for both spam and ham. The reasons why RBLs should not be used are nicely explained by Samuel Hart.

Some RBLs trust only users of very expensive connections. Many do not allow any mail from dynamic IP addresses. In many places there are no cheap static IP services. Supporting only these who are willing to pay more is fundamentally wrong (except for paid services).

Since I have my own SMTP server (privacy and the need to learn by doing are my main reasons for this) and I use an affordable Internet connection, an RBL decided to block my current IP address. Therefore technically I cannot send mails to Gmail users.

There are good methods of fighting with spam – use a secure, free operating system, recommend it to your friends and oppose RBLs. And Bayesian spam filtering by the recipient only may help if someone sends spam to you.

Problems with searching

| No Comments | No TrackBacks

Since a single word usually has many meanings, unknown to WWW search engines, it is difficult to find information about software with very common names.

The most popular English word, ‘the’ is also the name of THE multiprogramming system. A more well-known example is TeX, written in a way nearly impossible to typeset with a non-TeX-based system, since the name of TEX was already used then.

This is a much larger problem for search engines which find subsets of words, e.g. treating ‘tex’ as a part of ‘text’. Case-insensitive finding of whole words (with inflected forms) is also popular, but it makes finding information about LaTeX or latex (but not both) more difficult.

This may be partially solved by algorithms like PageRank. A Google search for ‘TeX’ returns on the first page only pages about TeX, for ‘LaTeX’ most are about LaTeX. But it just helps for one particular meaning. For example, it helps finding information about Python, but not a python.

There is a partial solution – find ‘python snakes’ instead of ‘pythons’ (although ‘pythons’ does not refer to the Python programming language, it refers to Monty Python). This clearly does not work in all situations and sometimes it is difficult to determine appropriate keywords.

I found a similar problem when searching for useful texts about From the Teeth of Angels, a novel by Jonathan Carroll. Since Carroll is so popular in Poland, I found mostly links to shops selling the book. When I want to buy a book, I use search engines specific to some shops where finding texts about the books is not a problem.

These problems show that using one search engine for everything leads to most difficulties. Specialized search engines avoid the problem. But is there any other solution?

We represent concepts by words. So instead of words, other representations of concepts should be more useful. For example, the symbol ‘i’ has different uses in complex arithmetic, procedural programming and in Latin (the two last meaning are used nearly everywhere else). We may use a different language with complex descriptions of meaning of each term used. When a review is known as a review, we may know that it is not a shop offer.

Even if it is possible to describe terms useful in real life in an unambiguous way, it would be additional work. Therefore it will not be done in at least some cases, making the problem smaller or the same as without it.

After installing Debian GNU/Linux

| No Comments | No TrackBacks

I have mostly three computers – one stable Gentoo server, one ~amd64 Gentoo workstation, and one laptop. Since I rarely use the laptop at home, I decided to install a binary distribution on it.

I do not like to frequently reinstall operating systems. I like also having very new software, deciding to install new versions in the week of upstream release. The distribution also should be user-friendly, i.e. help them control it instead of controlling them by unneeded packages and hidden configurations.

The first criterion limits my choice to very few GNU/Linux distributions (I did not try to use any other, better, kernel due to bad experiences with hardware support in the past). After reading an entry on Caleb Cushing’s blog I decided to try using Debian. Previously I knew about some of its important advantages – very stable releases (not appropriate on my laptop), large number of available packages (like on Gentoo or FreeBSD) and support for some forks encouraging free software (cdrkit, IceWeasel).

I though that there are Debian Sid installation discs. Since I could not find them, I used testing discs. During installation the only problem was lack of wireless network support (after some updates it worked with the ath5k driver). The only one, before I noticed that it installed KDE 3.

Unlike Gentoo with its several slots per package name, Debian names each library differently for each ABI. For non-library packages the same name is used. Confusingly, kwin was changed into a more descriptive name, but the package manager did nothing with this. The library situation led to using three versions of GCC, but this is not a problem when my laptop does not compile them.

After removing many unneeded packages (mostly from KDE 3) and installing needed ones, I noticed that TeXLive is available still in version 2007. It is caused by different packaging in version 2008. For Gentoo this was not a problem, they just keep it without updates. Maybe this will improve before TeXLive 2011.

In Gentoo the rc system is simple and easy to configure. In Debian removing hundreds of unnecessary packages did the job, but I still do not understand why so many runlevels are used there. I also do not know how to avoid having NetworkManager put useless domain name in /etc/resolv.conf, in Gentoo I just edited some configuration files instead of using NetworkManager.

Still, I believe that these problems may take less time than updating a source-based distribution.

liability-deltoid