explaining Bogofilter simply

Sun Jan 25 18:28:27 CET 2004

On Sun, 25 Jan 2004 17:06:32 +0000
Richard Kimber wrote:

> Yesterday's copy of The Times had a large article about spam in which
> the author implied there were no very effective spam filters, and in
> particular he says:
> "E-mail filters that analyse natural language, known as Bayesian
> filters, may take these garbled sentences [he quotes one, see below]
> to indicate personal correspondence."
> 
> I am thinking of sending The Times a letter in response to this, and I
> thought I would run a possible version past the list to ensure I've
> not got things wrong and to see if anyone has any additional
> suggestions.  One has to bear in mind that the readership is
> non-technical and that too long or too complex a letter would not be
> published.
> 
> 
> - Richard.

Hi Richard,

Great idea!

Given that you're on the east side of the pond, I assume you're
referring to the London Times, not the New York Times. Right?

My two big suggestions are:

   Break the long sentences into shorter sentences.  

   When possible, use positive constructs, rather than negative.

> ==============================
> 
> In his article on spam (The Times, 24th January), David Rowan seems
> unduly pessimistic about detecting such email, and is wrong in his
> evaluation of the Bayesian approach.
> 
> The message he quotes ("Highland alberich rampart discovery barnet
> clothesman walpole boot brainwash ...") would only be classified as
> real mail by a decently trained Bayesian system if one's
> correspondence normally contained this combination of words.
> 
> A Bayesian system, such as Bogofilter, would not analyse this as a
> sentence, garbled or otherwise, but rather it would tokenise the
> message and compute the probability that it is spam using a previously
> accumulated list of tokens for which it knows the probability that
> each token would occur in spam and non-spam messages. The system is
> normally given initial training in the recognition of acceptable and
> unacceptable email on the basis of an email archive, and in the few
> cases where it subsequently makes a mistake, it can be told to relearn
> the message. Accuracy of classification thereby gradually improves.

A Bayesian system, such as Bogofilter, would not analyse this as a
sentence, garbled or otherwise.  Bayesian systems break the message into
tokens.  They then compute the probability that each token would occur
in spam and non-spam messages. The system is normally given initial
training in the recognition of acceptable and unacceptable email using
previously received spam and non-spam.  Occasionally a filter will make
a mistake.  When this happens, it can be told to relearn the message.
Accuracy of classification thereby improves over time.

> I receive roughly 700 spam messages per week. At my system's current
> level of training, Bogofilter will mis-classify about 4% of these
> (roughly 3 or 4 per day) and I have not had a false positive (a
> genuine message wrongly classified as spam) for over eight weeks.

At my system's current level of training, Bogofilter is catching about
96% of these, only missing 3 or 4 per day.  It has not classified a
genuine message as spam for over eight weeks.

> 
> This system, coupled with automatic deletion on the mail server of
> messages containing obvious viruses, seems to me to be pretty
> effective spam filtering.