breaking the training db

Sun Sep 21 18:42:21 CEST 2003

On 20030921 (Sun) at 0941:37 -0400, David Relson wrote:

> You raise an interesting point.  I'm one of those "-u'sers" you refer to
> and I've not seen the problem you refer to.  Possibly it's due to my
> having a more comprehensive set of tokens in my wordlist because I _do_
> use '-u'.  Could this be an indicator of the weak point of
> train-on-error?

Train-on-error is definitely vulnerable if the user does it from
scratch, yes.  If, as I am continually recommending, people train on
all of the first ten thousand spams and nonspams and then switch to
on-error, it's not an issue.  (My wordlist.db has 968,643 tokens in it
not counting the .ROBX and such.)

> The change I _have_ seen was the 16 messages in my Spam-Unsure folder at
> 04:30 Friday morning.  Nearly all of those were the latest Microsoft
> worm.  Having trained on them as spam, bogofilter is improving in its
> recognition of them.  FWIW, I've seen more of this worm (approx 300)
> than of any other worm _ever_. 

I catch those in the malware filter that precedes the spam filter;
bogofilter is, for me, quite effective in dealing with the few that
elude the malware filter.  The 29 fp I got with the new parsing were
all short messages from lists that permit nonmembers to post; real spam
do appear there, and the list's header info does get counted among spam
tokens.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |