Exclusion Intervals

David Relson relson at osagesoftware.com
Thu Jul 1 00:00:16 CEST 2004


On Wed, 30 Jun 2004 17:46:20 -0400
Tom Anderson wrote:

> From: "David Relson" <relson at osagesoftware.com>
> > If I have 1000 spam and 10 ham and _every_ message has "relson" in
> > it, does that mean a message with "relson" is 100 times more likely
> > to be spam than ham?  I don't think so.
> 
> No, that's true, it should be neutral.  However, if you have 1000 spam
> and 10 ham, and 90 spam messages have "relson" in them, and only 1 ham
> message has "relson" in it, does that mean "relson" is hammy?  It does
> according to bogofilter, and maybe it makes sense statistically, but
> it is intuitively false.  All it takes is a few legitimate discussions
> of otherwise spammy topics to screw up all of the spam scores. 
> Previously I hadn't noticed such an effect, but my wordlist must
> really be getting out of balance now....MSG_COUNT is showing 247017
> spam and 8841 ham... that's about 28:1.  Every token which has been
> seen in both ham and spam is automatically 28x more hammy?  This just
> doesn't make sense.

Remember there have been discussions about the value of "balance", i.e.
having reasonably balanced of spam and ham messages, say within a factor
of 2 or 3.  At 28::1, you're way out of balance.

To take the flip side of the argument, if "X" occurs 88 times in your
ham, i.e. 1%, and 1000 times in spam, i.e. 0.005%, do you really want to
consider X as a spam indicator?  I wouldn't.

...[snip]...

> No, I'm saying that's the probability that a _token_ is spam, not the
> whole message.  Do the chi square thing with these token probabilities
> for the full email score.  I can see how that would fail too when
> every message contains a given token though.
> 
> While I see your point, do you see mine?

I think so, though who can be sure?  

If you want to experiment with (or use) raw counts, without adjustment,
file src/prob.c has function calc_prob() and that's where the relevant
arithmetic is.

Remember bogofilter is an implementation of the algorithms in Graham and
Robinson.  If you want other calculations, you'll need to modify it.

David



More information about the Bogofilter mailing list