New (?) idea to optimize database

Boris 'pi' Piwinger 3.14 at piology.org
Sun Mar 19 10:27:24 CET 2006


David Relson <relson at osagesoftware.com> wrote:

>> Do bogominitrain, remove all tokens which show up only once
>> in the training body (to do so, full training is needed in
>> a separate body). Also prevent those tokens from being added
>> again and do bogominitrain again. Repeat until is converged.

Clearly, this could be improved by first identifying those
tokens which show up only once in the mail collection and
then doing bogominitrain with prevent.db.

>Clearly the "prevent" part is a problem as it implies a "prevent"
>database and bogofilter lacks such a concept hence couldn't use
>prevent.db even if it existed.

Right.

>A prevent database (as you describe it) is simply the hapax list
>(tokens that appear once) from the current list.  It could be built
>fairly easily with

This is what I had in mind.

>Bogofilter _could_ be modified fairly easily to write the database
>update info (from the "-n" and "-s" flags) to a file which could be
>filtered with the prevent list and the resulting tokens could be loaded
>into the wordlist using bogoutil.
>
>As this implies running bogofilter, then grep, then bogutil for each
>message, it would be very slow.

Indeed, no chance.

>P.S.  On a related note, in the past Greg and I have experimented with
>removing hapaxes from the database.
>Doing this has a noticeable effect on bogofilter's results -- accuracy
>goes down.

I'd claim this is because you cannot tell good from bad
hapaxes. To the contrary, bogominitrain will have all tokens
in very small numbers. My idea is that the significance of
those tokens will increase if we get rid of those which are
used (during training) to classify exactly one message and
presumably don't have any future use.

One might then argue that new messages will change this.
This will happen once in a while. So my concept would
include in bogominitrain a recreation of the prevent.db each
training session.

Tests would be interesting, but pretty slow. Actually, I
rebuilt my database yesterday with 120,000+ messages. It
took about 10 hours.

pi



More information about the Bogofilter mailing list