Method of training

Tue Sep 9 03:50:21 CEST 2003

| jxz <jxz at uol.com.br> wrote:
| 
| >After, I would classify manually the messages from this temp mbox, and
| >send to train.ham.35 and train.spam.35 mboxes, and train bogofilter.
| >
| >With this method, I should only train bogofilter by it's errors, and
| >there will be no need to save the whole cruft of spam, only the unsures,
| >in case of db corrupt or db schema updates.
| 
| I would not suggest this. Every time you train with new
| messages, the rating of all previously seen messages
| changes. I described an example where this can happen to the
| unexpected direction. So what does that mean for you? If you
| save only those messages which have been unsure (or
| failures) when they were seen for the first time, you will
| lose significant information when you retrain. So my advice
| is to keep all those messages.

These messages that are being saved (the unsures and the f.n. and
f.p) are the _only_ messages that are registered in the bogofilter
database. If I erase the database and reclassify that messages,
the database and the accuracy will remain the same as before the
deletions. In other words, backuping those messages or the dump of
wordlist.db is the same thing. So I did not understand your point :(

I'm tired of mantain tons of emails, spams and do backups of
large bayesians classifiers databases, so I'm trying to do a most
fined-grained train-on-error as possible. My initial idea was killed
by the experience: I took all the email of the last 2 months and
put them all in one mbox, sorted by date. In the beggining, I was
classifying the messages by hand _one by one_, and after ~ 6 weeks,
the unsures number has dropped, and I could train on blocks of 5, 10,
50 emails, and classify the unsures. Now, from a corpus of ~ 2000
emails, my wordlist.db contains 146 spams and 77 hams.

When I fetch my emails via POP3 (in batches of ~40), the spams are
saved in the spam mbox (in the future, spams classified as 1.00 will
be erased and I will only save the headers via procmail to statistical
purposes). And the few errors and unsures (that mutt shows me with
different colors) are, via key macros, classified in bogofilter, and a
copy is sent to the spam and ham unsures backup.

I am doing this for a few days, and it's _much_ soon to come with a
conclusion. But the idea is: have a smaller as possible database,
with only errors (and yes, carefully inspected from me, because some
messages unsures that are not spam tend to come only one time, so
there is no need to classify in the db).

This is test that will take some months, until I can really trust
bogofilter and forget that %*@##%%#& spams, and stop to play with the
filters.

The only concern is that the database will become old, and purging terms
will do a great hurt in the accuracy.

| >Now I ask: the train-on-error method works well? 
| 
| It works excellent. And you can even do with fewer messages,
| you actually add messages which will at the time of adding
| be rated correctly already. See the FAQ for details on the
| training methods.

Yes, when I have, suppose, 5 unsure messages classified from
receiving, I register the first, and always do a bogofilter -v in the
next unsure to see if the last classification was good to this message
and sometimes it dont need to be registered, and so on for the others
unsure messages.

Of course the classifications of the others older messages will change,
but what is important is the classification of the _new_ messages.

All the best!

-- 
jxz at uol.com.br
http://jxz.dontexist.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20030908/1434405a/attachment.sig>