New script to train bogofilter

Greg Louis glouis at dynamicro.on.ca
Fri Jul 4 01:11:25 CEST 2003


On 20030702 (Wed) at 0913:13 +0200, Boris 'pi' Piwinger wrote:
> Boris 'pi' Piwinger wrote:
> 
> > I wrote a perl script which trains bogofilter on error. It
> > produces very small databases. We'll have to see how good
> > that works. Any comments are warmly welcome.
> 
> I reran my script until I got no errors. It was still
> extremely small: 352 spam and 291 ham
> 
> Then I started to use it. This is 24 hours ago now. I just
> had one false negative (with over 100 spam messages
> correctly classified) and no false positive.
> 
> So my first estimation: This works perfectly, we need far
> less messages in the database than we thought before. There
> seems to be no practical reason to avoid multiple
> classification of the same message.

I guarantee you will be less happy a month from now.  This technique
(of training till the errors disappear) is totally bogus, statistically
invalid, and will lead to rotten performance down the road.  I've done
this once or twice myself, with fp that really seemed to need
correcting, and it is a short-term gain for long-term pain.  I don't
have time to write a detailed explanation of why this is so, but I will
try to do so within a few days.  This is about not overworking the
data, an area of statisitics that is not at all intuitive but none the
less critical.

Sorry you had troubles with bogotune.  I can't take time now to look at
that either.  Just back from a trip and I have to catch up with the
stuff I get paid for.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |

Header information for this message:
Subject: Re: New script to train bogofilter
     To: bogofilter at aotto.com
   From: Greg Louis <glouis at dynamicro.on.ca>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030703/abcb6f90/attachment.sig>


More information about the Bogofilter mailing list