Unsures [was: How to deal with extremely high spam levels]

Wed Jun 23 15:02:50 CEST 2004

On 23 Jun 2004 08:41:41 -0400
Tom Anderson wrote:

...[snip]...

> The load is minimal.  Usually it takes under 1s to process an email. 
> I just watched some email coming in using "top"... I saw procmail for
> a split second, and spamitarium didn't even register... it was either
> too fast or too far down the list (sorted by CPU load, 1s intervals). 
> And I'm running on a K6.  Dude, 1000 emails is nothing, and C isn't
> necessarily faster than Perl.  On a Linux system, a great deal of
> things are running on Perl.  I just ran spamitarium 1000 times, and it
> (plus the bash loop) used 71.6 cpu seconds, on a K6.

Out of curiosity, how long does bogofilter take for the same set of
messages?

> > I've been doing that.  Most of my "unsure" spam is still scoring
> > very near 0.5.
> 
> Doing exhaustive training should move hams and spams out away from
> 0.5.

In this detail, we differ.  Bogotune found that the best parameters for
my mail were:

  robx=0.549138
  robs=0.0178
  min_dev=0.435		

  ham_cutoff=0.376	
  spam_cutoff=0.501	

My wordlist currently has 1,379,082 tokens, 62,983 spam messages, and
75,386 ham messages.

With these values, my Unsures are still clustered around 0.500

> > Haven't done *any* initial training.  Just training on error.  Like
> > I said, I had an unfortunate accident which wiped out my email spool
> > (and my carefully trained bogofilter database) and I'm having to
> > start over from scratch.
> 
> I didn't do any initial training either.  I use -u and register every
> error.  Works fine.

Ditto.  '-u' works fine here and I deal with every unsure and every
error.  

The Unsures I hate are the mailing lists' requests for moderator action
on obvious spam.  Those messages have list headers (many hammish tokens)
and spam content.  As I _do_ need to see those messages, I don't want
them classified as spam.  I also don't really consider them to be ham. 
I'm inclined to make an exception and skip training with them.

Note:  I noticed that a lot of my messages score were obviously ham
(scoring 0.000000) or spam (1.000000).  I also noticed that my wordlist
grows quite rapidly with "-u".  I decided to see what happens if '-u'
_didn't_ update when the scores were obvious, and implemented a
configuration parameter "thresh_update", i.e. update threshold, to
control the behavior.  

Result:  with "thresh_update=0.01", there has been no noticeable effect
on accuracy and the wordlist grows much less rapidly.

Regards,

David