tuning 0.10.1.1

Sun Jan 26 20:04:36 CET 2003

At 01:41 PM 1/26/03, Greg Louis wrote:

>This is a very informal report of a preliminary session tuning
>bogofilter 0.10.1.1 with Robinson's f(w) calculation and Fisher's
>method for combining probabilities (aka Robinson-Fisher).

Greg,

Bogofilter's regression tests include small spam and ham mailboxes for 
building a known test database, and some messages that are scored against 
that database.  Using that environment, I wonder what the optimum numbers 
are.  Can I talk you into doing the work?  With your considerable 
experience and the small size of the corpus, it shouldn't take you more 
than a few seconds :-)

>With the mime processing, I'm getting about 60% more tokens in the
>training db's spamlist than were present in the 0.8.0 training db.
>This has an unfortunate downside: lookup times are extremely long. The
>first 150,000 tokens entered into an empty list took 31 seconds to
>process; but the first 500,000 tokens entered into a separate list,
>also starting from empty, took 13 minutes and 13 seconds.  Classifying
>an individual email with the 500,000-token spamlist and the
>150,000-token goodlist can take several hundred milliseconds, and
>registering new spam messages on top of what's there now takes around
>700ms each (new nonspams are taking about 25 ms each to register).

Sounds like the database becomes inefficient for large token 
sets.  Recently it has been suggested that bogofilter use a single wordlist 
which contains both spam and ham counts for each token.  I imagine that 
this would help classification (by halving the number of db accesses) and 
hurt registration (always writing to a large db).  If you think it's worth 
testing, I'll do the coding...