Training ham seems difficult

Andreas Pardeike andreas at pardeike.net
Tue Jan 13 17:47:43 CET 2004


On 2004-01-13, at 16.24, Eric Wood wrote:

> Initially I had a few power users set up on imap so I could nab their 
> ham.
> After getting bogofilter up and running, they have since switched to 
> pop.

That's pretty much my situation. I can easily train bogofilter for the 
first
time by hand until it start becoming useful. That's not the problem 
here.

> In procmail, I have an account which get a copy of everyones email 
> which is
> mostly ham because bogofilter already nabbed the spam into a different 
> box.

OK, the "mostly" in that answer is exactly what I am trying to talk 
about.
Assuming that some spam isn't detected it will be trained as spam but my
users will move that by hand into a mailbox where I can collect it and 
then
train it for spam. But every message there was already trained as ham so
my question is: must I "undo" the ham-training on those messages 
together
with my spam training?

> So with two box two box growing with spam and ham I can retrain with 
> the
> same number of messages on both sides.

Not really. If you do as described I would see the following picture:

State                Trained as spam       Trained as ham
---------------------------------------------------------
Initial training      500                   500        (i.e. you start 
even)

After receiving a     500                   501
ham message

After receiving a     501                   501
detected spam msg

After receiving a     501                   502
undetected spam msg
(procmail part)

After receiving a     502                   502
undetected spam msg
(user corrects msg
as spam)

So you got 1 new ham and 2 spams and end up with one ham
trained too much (that one message is trained as ham + spam).
Statistically this might cancel each other out so all
non-detected spam that is then corrected by the user will not
make your stats better.

My question here is if the user detected spam should be
untrained first and then trained as spam.

Or am I missing the whole point here?

Regards,
Andreas Pardeike
-- If no symptoms manifest, does a problem exist?





More information about the Bogofilter mailing list