Exclusion Intervals

Thu Jul 1 14:03:12 CEST 2004

On Wed, 2004-06-30 at 18:00, David Relson wrote:
> Remember there have been discussions about the value of "balance", i.e.
> having reasonably balanced of spam and ham messages, say within a factor
> of 2 or 3.  At 28::1, you're way out of balance.

Yes, I recall.  Until recently, I noticed no effect from the imbalance. 
Or rather, the effect was visible, but not identifiable.  But how to
balance it?  This is the actual ratio of email that I receive.  I'm not
going to pad it with someone else's hams.  Is there an easy way to
expire lots of the spam tokens to decrease its .MSG_COUNT?  Would moving
the center point of token scoring be an effective counter-measure?

> To take the flip side of the argument, if "X" occurs 88 times in your
> ham, i.e. 1%, and 1000 times in spam, i.e. 0.005%, do you really want to
> consider X as a spam indicator?  I wouldn't.

I would.  The fact that it appears in over 10x more spam than ham is
important.  At the very least it should be neutral, not hammy.  Just
because I have lots of spam about "Y" in my wordlist doesn't mean that
"X" is less spammy.  Let's say that I didn't have spams about "Y", now
suddenly "X" jumps to 10% or 50%.  Spams about "Y" should not have this
effect on the spaminess of "X".

For a more realistic example, consider the token "viagra".  If I've
received 100 spams and 50 of them were about viagra, viagra is a very
high spam indicator.  Now, lets say spammers later send me 500 spams
about "cialis", and 500 about "porn", and 500 about "software".  Now
"viagra" is no longer spammy?  How did that happen?  These other spams
have nothing to do with the spamminess of "viagra".  If I receive a spam
about viagra after this, I still want it to score as it would have
before these 1500 new spams.  Why do tokens have to vie for the top spam
spot?  I think that many tokens can be equally high spam indicators.

This seems to be a fundamental flaw in the scoring.  As more tokens are
added, every token's spamminess is diluted.  It could be possible to
have a wordlist with not a single token exceeding 0.5+min_dev due to
dilution.  The result is that the cutoffs have to keep moving down to
compensate.

> If you want to experiment with (or use) raw counts, without adjustment,
> file src/prob.c has function calc_prob() and that's where the relevant
> arithmetic is.
> 
> Remember bogofilter is an implementation of the algorithms in Graham and
> Robinson.  If you want other calculations, you'll need to modify it.

What I want is to discuss what the appropriate solution to this problem
might be.  I'm not alone in having an unbalanced wordlist.  To some
extent, this effects almost everyone unless you are very strict about
how you train.  Anyone using autoupdate and train-on-error must contend
with this.

Tom