explaining Bogofilter simply

Richard Kimber rkimber at ntlworld.com
Sun Jan 25 18:06:32 CET 2004


Yesterday's copy of The Times had a large article about spam in which the
author implied there were no very effective spam filters, and in
particular he says:
"E-mail filters that analyse natural language, known as Bayesian filters,
may take these garbled sentences [he quotes one, see below] to indicate
personal correspondence."

I am thinking of sending The Times a letter in response to this, and I
thought I would run a possible version past the list to ensure I've not
got things wrong and to see if anyone has any additional suggestions.  One
has to bear in mind that the readership is non-technical and that too long
or too complex a letter would not be published.


- Richard.

==============================

In his article on spam (The Times, 24th January), David Rowan seems unduly
pessimistic about detecting such email, and is wrong in his evaluation of
the Bayesian approach.

The message he quotes ("Highland alberich rampart discovery barnet
clothesman walpole boot brainwash ...") would only be classified as real
mail by a decently trained Bayesian system if one's correspondence
normally contained this combination of words.

A Bayesian system, such as Bogofilter, would not analyse this as a
sentence, garbled or otherwise, but rather it would tokenise the message
and compute the probability that it is spam using a previously accumulated
list of tokens for which it knows the probability that each token would
occur in spam and non-spam messages. The system is normally given initial
training in the recognition of acceptable and unacceptable email on the
basis of an email archive, and in the few cases where it subsequently
makes a mistake, it can be told to relearn the message. Accuracy of
classification thereby gradually improves.

I receive roughly 700 spam messages per week. At my system's current level
of training, Bogofilter will mis-classify about 4% of these (roughly 3 or
4 per day) and I have not had a false positive (a genuine message wrongly
classified as spam) for over eight weeks.

This system, coupled with automatic deletion on the mail server of
messages containing obvious viruses, seems to me to be pretty effective
spam filtering.

===============================

-- 
Richard Kimber
http://www.psr.keele.ac.uk/




More information about the Bogofilter mailing list