garbage removal and 'outsiders noise'

Greg Louis glouis at dynamicro.on.ca
Fri Apr 18 13:00:42 CEST 2003


On 20030417 (Thu) at 2159:06 -0400, Jim Correia wrote:
> On Thursday, April 17, 2003, at 07:43  PM, Greg Louis wrote:
> 
> >I don't run with -u, but train manually: copy all mail to a single mbox
> >file, and periodically use bogofilter to break it in 3: spam, nonspam,
> >unsure.
> 
> Is there a reason you do it this way?
> 
Short answer: http://www.bgl.nu/training2.html

Longer answer: I consider -u harmful in that it introduces cumulative
errors into the training database.  These one must find and correct
manually at regular intervals, otherwise discrimination degrades. 
Between these manual fixes, discrimination degrades as well, though
that may not be a very large effect.

The method I described has the advantage that no (human fallibility
excepted) wrong classifications enter into the training database.  As
for training on unsures and errors only, that has been shown (once one
reaches a reasonable size of training db, around 5000 each of spam and
nonspam messages) to be as effective as, or just marginally less
effective than, training on every message (which is analogous to
telling bogofilter not only what it needs to learn, but what it already
knows).

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list