Training ham seems difficult
Andreas Pardeike
andreas at pardeike.net
Tue Jan 13 17:47:43 CET 2004
On 2004-01-13, at 16.24, Eric Wood wrote:
> Initially I had a few power users set up on imap so I could nab their
> ham.
> After getting bogofilter up and running, they have since switched to
> pop.
That's pretty much my situation. I can easily train bogofilter for the
first
time by hand until it start becoming useful. That's not the problem
here.
> In procmail, I have an account which get a copy of everyones email
> which is
> mostly ham because bogofilter already nabbed the spam into a different
> box.
OK, the "mostly" in that answer is exactly what I am trying to talk
about.
Assuming that some spam isn't detected it will be trained as spam but my
users will move that by hand into a mailbox where I can collect it and
then
train it for spam. But every message there was already trained as ham so
my question is: must I "undo" the ham-training on those messages
together
with my spam training?
> So with two box two box growing with spam and ham I can retrain with
> the
> same number of messages on both sides.
Not really. If you do as described I would see the following picture:
State Trained as spam Trained as ham
---------------------------------------------------------
Initial training 500 500 (i.e. you start
even)
After receiving a 500 501
ham message
After receiving a 501 501
detected spam msg
After receiving a 501 502
undetected spam msg
(procmail part)
After receiving a 502 502
undetected spam msg
(user corrects msg
as spam)
So you got 1 new ham and 2 spams and end up with one ham
trained too much (that one message is trained as ham + spam).
Statistically this might cancel each other out so all
non-detected spam that is then corrected by the user will not
make your stats better.
My question here is if the user detected spam should be
untrained first and then trained as spam.
Or am I missing the whole point here?
Regards,
Andreas Pardeike
-- If no symptoms manifest, does a problem exist?
More information about the Bogofilter
mailing list