Unsures [was: How to deal with extremely high spam levels]
David Relson
relson at osagesoftware.com
Wed Jun 23 15:02:50 CEST 2004
On 23 Jun 2004 08:41:41 -0400
Tom Anderson wrote:
...[snip]...
> The load is minimal. Usually it takes under 1s to process an email.
> I just watched some email coming in using "top"... I saw procmail for
> a split second, and spamitarium didn't even register... it was either
> too fast or too far down the list (sorted by CPU load, 1s intervals).
> And I'm running on a K6. Dude, 1000 emails is nothing, and C isn't
> necessarily faster than Perl. On a Linux system, a great deal of
> things are running on Perl. I just ran spamitarium 1000 times, and it
> (plus the bash loop) used 71.6 cpu seconds, on a K6.
Out of curiosity, how long does bogofilter take for the same set of
messages?
> > I've been doing that. Most of my "unsure" spam is still scoring
> > very near 0.5.
>
> Doing exhaustive training should move hams and spams out away from
> 0.5.
In this detail, we differ. Bogotune found that the best parameters for
my mail were:
robx=0.549138
robs=0.0178
min_dev=0.435
ham_cutoff=0.376
spam_cutoff=0.501
My wordlist currently has 1,379,082 tokens, 62,983 spam messages, and
75,386 ham messages.
With these values, my Unsures are still clustered around 0.500
> > Haven't done *any* initial training. Just training on error. Like
> > I said, I had an unfortunate accident which wiped out my email spool
> > (and my carefully trained bogofilter database) and I'm having to
> > start over from scratch.
>
> I didn't do any initial training either. I use -u and register every
> error. Works fine.
Ditto. '-u' works fine here and I deal with every unsure and every
error.
The Unsures I hate are the mailing lists' requests for moderator action
on obvious spam. Those messages have list headers (many hammish tokens)
and spam content. As I _do_ need to see those messages, I don't want
them classified as spam. I also don't really consider them to be ham.
I'm inclined to make an exception and skip training with them.
Note: I noticed that a lot of my messages score were obviously ham
(scoring 0.000000) or spam (1.000000). I also noticed that my wordlist
grows quite rapidly with "-u". I decided to see what happens if '-u'
_didn't_ update when the scores were obvious, and implemented a
configuration parameter "thresh_update", i.e. update threshold, to
control the behavior.
Result: with "thresh_update=0.01", there has been no noticeable effect
on accuracy and the wordlist grows much less rapidly.
Regards,
David
More information about the Bogofilter
mailing list