Unsures [was: How to deal with extremely high spam levels]

Thu Jun 24 14:13:43 CEST 2004

On Wed, 2004-06-23 at 09:02, David Relson wrote:
> > things are running on Perl.  I just ran spamitarium 1000 times, and it
> > (plus the bash loop) used 71.6 cpu seconds, on a K6.
> 
> Out of curiosity, how long does bogofilter take for the same set of
> messages?

Alone, "bogofilter -v" on the same message as above took 63.2 cpu
seconds for 1000 intervals.  Together, piping the output of spamitarium
to bogofilter took 110.7 cpu seconds, so spamitarium actually reduces
the processing time of bogofilter by roughly one-third, the combined
time being only 47.5 seconds longer than bogofilter alone.  I haven't
done any optimizations to spamitarium yet, though a few changes in
regexes may cut several seconds from that time as well... a future
project perhaps.  In any event, running spamitarium isn't going to
produce any undue stress on anybody's mail server when the combined
processing time for both is only 0.11s/email on a K6.

> > > I've been doing that.  Most of my "unsure" spam is still scoring
> > > very near 0.5.
> > 
> > Doing exhaustive training should move hams and spams out away from
> > 0.5.
> 
> In this detail, we differ.  Bogotune found that the best parameters for
> my mail were:
> 
>   robx=0.549138
>   robs=0.0178
>   min_dev=0.435		
> 
>   ham_cutoff=0.376	
>   spam_cutoff=0.501	
> 
> My wordlist currently has 1,379,082 tokens, 62,983 spam messages, and
> 75,386 ham messages.
> 
> With these values, my Unsures are still clustered around 0.500

My point was that exhaustive training will make more of the tokens
meaningful, and thus move a lot more messages out of the unsure range. 
I can't see anyone arguing that repeatedly registering a single message
won't make its tokens more significant in later scoring.  Since the
question here is what to do when you have very little ham, my answer is
that in order to keep ham from scoring neutrally, you should strongly
bias hammy tokens with exhaustive training.

> Note:  I noticed that a lot of my messages score were obviously ham
> (scoring 0.000000) or spam (1.000000).  I also noticed that my wordlist
> grows quite rapidly with "-u".  I decided to see what happens if '-u'
> _didn't_ update when the scores were obvious, and implemented a
> configuration parameter "thresh_update", i.e. update threshold, to
> control the behavior.  
> 
> Result:  with "thresh_update=0.01", there has been no noticeable effect
> on accuracy and the wordlist grows much less rapidly.

While I'm sure this feature is great, I haven't upgraded to that version
yet.  But I still find that my wordlist is growing more slowly with
time.  It's currently 37M.  Two months ago it was 30M.  I started this
wordlist about 8 months ago, so the average monthly growth is around
4.6M/month, while the current growth is around 3.5M/month.  That's a
deceleration of about 0.14M/month.  If that rate continued, there'd be
zero growth in about 25 months with a wordlist somewhere around 100M.  I
believe the slowing of growth is due to the fact that many of the tokens
are already in there and now they're mostly just getting their counts
incremented.  I'd imagine that growth will eventually become asymptotic
to some upper limit, which can be lowered via trimming of hapaxes,
reordering the database, and whatnot.  With careful pruning, the upper
limit may be around 50M.  With excessive pruning, perhaps much lower.

Tom