breaking the training db
Greg Louis
glouis at dynamicro.on.ca
Sun Sep 21 18:42:21 CEST 2003
On 20030921 (Sun) at 0941:37 -0400, David Relson wrote:
> You raise an interesting point. I'm one of those "-u'sers" you refer to
> and I've not seen the problem you refer to. Possibly it's due to my
> having a more comprehensive set of tokens in my wordlist because I _do_
> use '-u'. Could this be an indicator of the weak point of
> train-on-error?
Train-on-error is definitely vulnerable if the user does it from
scratch, yes. If, as I am continually recommending, people train on
all of the first ten thousand spams and nonspams and then switch to
on-error, it's not an issue. (My wordlist.db has 968,643 tokens in it
not counting the .ROBX and such.)
> The change I _have_ seen was the 16 messages in my Spam-Unsure folder at
> 04:30 Friday morning. Nearly all of those were the latest Microsoft
> worm. Having trained on them as spam, bogofilter is improving in its
> recognition of them. FWIW, I've seen more of this worm (approx 300)
> than of any other worm _ever_.
I catch those in the malware filter that precedes the spam filter;
bogofilter is, for me, quite effective in dealing with the few that
elude the malware filter. The 29 fp I got with the new parsing were
all short messages from lists that permit nonmembers to post; real spam
do appear there, and the list's header info does get counted among spam
tokens.
--
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter
mailing list