New script to train bogofilter

Peter Bishop pgb at adelard.com
Thu Jul 3 12:54:13 CEST 2003


On 30 Jun 2003 at 17:25, Boris 'pi' Piwinger wrote:

> Hi!
> 
> I wrote a perl script which trains bogofilter on error. It
> produces very small databases. We'll have to see how good
> that works. Any comments are warmly welcome.
> 
It will be interesting to see how well this works. 

In principle it might mean that the same spam is submitted several times
Maybe the script could "quarantine" spams that have already been
used for training

In practice I don't think this will be a big problem, as the train on error
approach should preferentially select spams with unknown tokens
rather than existing ones.

I think this approach should be evaluatated because I would expect a "train
on error" database to be more effective than a database trained
with a similar number of "typical" messages.

This is because the "ttrain on error" process preferentially selects spams 
that differ from the existing spam corpus, so helping to get an even spread 
of spam tokens over the whole "universe" of spam tokens. By contrast a 
"typical" set of spams tend to have a lot a similar spams with tokens that 
cluster in one particular bit of the spam universe (e.g. a Viagra cluster). 
So the growth of coverage of the spam universe is slower, and false 
negatives will occur when a different type of spam arrives that hits an 
unmapped part of the spam universe. 

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list