upgrading from 0.9.1.2

Peter Bishop pgb at adelard.com
Wed Aug 20 11:39:27 CEST 2003


On 19 Aug 2003 at 18:35, David Relson wrote:

> The minimum necessary is to install the new package.  After doing that, it
> would be a good idea to run bogoupgrade - to merge your spamlist.db and
> goodlist.db into a new wordlist.db.  That's all that is necessary.
> 

I upgraded from 0.9.1 to 0.13.6 and found the performance *decreased* with 
the default settings.

This is because bogofilter is now case sensitive (i.e. Free, free and FREE)
are treated as different tokens.

As your 0.9 database only contains lower case tokens, the 0.14 bogofilter
is more likely to classify spams with a lot of upper case characters as ham
- so you get more false negatives.

For a couple of months I tried to retrain the database by adding new
case-sensitive hams and spams to the database, but performance was still
worse. So I switched back  the default 0.9 mode of operation by selecting 
the "case insensitive" switch 

-Pi

for both checking emails and for registering emails in the database.

Now the performance is better than 0.9.

To get the most out of 0.14, you should rebuild the database using
the original hams and spams 
- but this was not an option for me as I do not keep copies of
the spams or a record of what hams I used.

However, I think you now have a new option in 0.14 called
"token degeneration" contolled by the following switches:

-PD - disable degeneration (default)
-Pd - enable degeneration
-Pf - enable first match (default)
-PF - enable best indicator

-Pd means that if you don't find the token "FREE" you search for a token 
with the same letters (like "Free" or "free")
-Pf searches for all options

So to summarise you can either:

1) rebuild the database from scratch and use the 0.14 defautls

2)  Use the existing lower case database with -Pi for checking and 
registiering.

3) Use the existing database with "-Pd -Pf" for checking, but allow
case senstive tokens to be registered in the database.

Option 3) behaves like option 2) initially but should eventually behave
like option 1) once there are a large number of case-senstive tokens
in the database. (but I have not tried this as I still use 0.13.6)
-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list