Method of training

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Tue Sep 9 08:52:39 CEST 2003


jxz <jxz at uol.com.br> wrote:

>| >With this method, I should only train bogofilter by it's errors, and
>| >there will be no need to save the whole cruft of spam, only the unsures,
>| >in case of db corrupt or db schema updates.
>| 
>| I would not suggest this. Every time you train with new
>| messages, the rating of all previously seen messages
>| changes. I described an example where this can happen to the
>| unexpected direction. So what does that mean for you? If you
>| save only those messages which have been unsure (or
>| failures) when they were seen for the first time, you will
>| lose significant information when you retrain. So my advice
>| is to keep all those messages.
>
>These messages that are being saved (the unsures and the f.n. and
>f.p) are the _only_ messages that are registered in the bogofilter
>database. 

Right. That's dangerours. The other messages might be seen
totally different at a later point of time.

>If I erase the database and reclassify that messages,
>the database and the accuracy will remain the same as before the
>deletions. In other words, backuping those messages or the dump of
>wordlist.db is the same thing. So I did not understand your point :(

Say you have a message rated as ham in the first place. So
you delete it. Now you add some messages. Now this first
messages rating changed to spam, but you don't have it
anymore. Similar messages now might generate false
positives. Or vice versa.

>I'm tired of mantain tons of emails, spams 

Zip them.

>and do backups of
>large bayesians classifiers databases, so I'm trying to do a most
>fined-grained train-on-error as possible. 

Than you could use bogominitrain.pl and save just those used
for training, but again, I strongly suggest *not* to do it.

>When I fetch my emails via POP3 (in batches of ~40), the spams are
>saved in the spam mbox (in the future, spams classified as 1.00 will
>be erased and I will only save the headers via procmail to statistical
>purposes). 

Anyhow you should check the logfile for errors.

>I am doing this for a few days, and it's _much_ soon to come with a
>conclusion. But the idea is: have a smaller as possible database,
>with only errors 

Again this is bogominitrain.pl.

>| >Now I ask: the train-on-error method works well? 
>| 
>| It works excellent. And you can even do with fewer messages,
>| you actually add messages which will at the time of adding
>| be rated correctly already. See the FAQ for details on the
>| training methods.
>
>Yes, when I have, suppose, 5 unsure messages classified from
>receiving, I register the first, and always do a bogofilter -v in the
>next unsure to see if the last classification was good to this message
>and sometimes it dont need to be registered, and so on for the others
>unsure messages.

Again, bogominitrain.pl does that for you.

pi




More information about the Bogofilter mailing list