New (?) idea to optimize database

David Relson relson at osagesoftware.com
Sat Mar 18 20:20:46 CET 2006


On Sat, 18 Mar 2006 16:04:54 +0100
Boris 'pi' Piwinger wrote:

> Hi!
> 
> We had lengthy discussions how to optimize (=minimize) the
> database to get best performance. This is why I created
> bogominitrain. Now clearly, this will also collect useless
> tokens. Now here is the idea to improve:
> 
> Do bogominitrain, remove all tokens which show up only once
> in the training body (to do so, full training is needed in
> a separate body). Also prevent those tokens from being added
> again and do bogominitrain again. Repeat until is converged.
> 
> Clearly extremely expensive and I have no real idea how to
> implement, but it should give a real powerful database.
> 
> pi

Hi pi,

Interesting idea!

Clearly the "prevent" part is a problem as it implies a "prevent"
database and bogofilter lacks such a concept hence couldn't use
prevent.db even if it existed.

A prevent database (as you describe it) is simply the hapax list
(tokens that appear once) from the current list.  It could be built
fairly easily with

   for FILE in ham/*  ; do bogofilter -n $FILE ; done
   for FILE in spam/* ; do bogofilter -s $FILE ; done
   bogoutil -d wordlist.db | egrep " (0 1|1 0) " > prevent.txt
   bogoutil -d prevent.db -I prevent.txt

Using the "prevent" tokens is, however, harder since they are an
exclusion list applies to tokens being entered in the database.

Bogofilter _could_ be modified fairly easily to write the database
update info (from the "-n" and "-s" flags) to a file which could be
filtered with the prevent list and the resulting tokens could be loaded
into the wordlist using bogoutil.

As this implies running bogofilter, then grep, then bogutil for each
message, it would be very slow.

A better way would be to modify bogofilter to open the prevent database
and check each new token against it before adding the token to the real
database.

If somebody wants to work with Pi to do this and then see the effect,
I'd be interested in hearing the results.

Regards,

David

P.S.  On a related note, in the past Greg and I have experimented with
removing hapaxes from the database.
Doing this has a noticeable effect on bogofilter's results -- accuracy
goes down.




More information about the Bogofilter mailing list