Exclusion Intervals

Wed Jun 30 20:35:48 CEST 2004

From: "David Relson" <relson at osagesoftware.com>
> You're forgetting the importance of .MSG_COUNT.  Suppose you have 10
> spam and 100 ham and Jan has counts of 10/10.  Since it's in 100% of the
> spam and 10% of the ham, its spam score should be up around 90%.

Ok, I wasn't aware that ALL emails were taken into account... I thought the
probability of any given token being spam was the number of spam emails out
of the total number of emails in which that token had occurred.  You're
saying that the probability is the number of spams that had that token out
of all spams averaged with the number of hams that had that token out of all
hams?  I guess that solves a lot of my questions.

spams/ token                    [   spams w/token        hams w/token  ]
------------------      vs    |  -----------------  &  --------------  |
all emails with token          [    all spams                    all
     ]

Doesn't the way I thought it was working actually produce a more accurate
value for any given token?  It seems like the .MSG_COUNT method dilutes
token scores as more emails are added.  So if I saw "viagra" just once, it
would be a weaker and weaker spam indicator the more non-viagra spams I
registered.  In reality, "viagra" may be just as spammy a year later as it
was the first time I saw it.  Having more kinds of spam shouldn't make
existing kinds of spam less spammy.  Maybe this is why terms like
"Microsoft", "Office", "Windows", etc are so hammy despite the fact that the
vast majority of the time these show up in spam and not ham.  I just receive
less ham than spam.  Doesn't seem right.

> > As you can see, my 0.2 min_dev is keeping all of the tokens between
> > 0.3 and 0.7 from contributing to the final score.  However, those
> > tokens in the 0.1-0.3 range are not very hammy (eg: professional,
> > office, software, $60, etc), while the ones between 0.5 and 0.7 are
> > actually quite spammy (eg: Adobe, Photoshop, etc).  This email would
> > probably score appropriately if the min_dev range was centered between
> > my cutoffs near 0.3.

Ok, so the problem isn't that the min_dev is wrong (although it still might
fix the problem), it's that I've received a few hams with spammy tokens, but
the spamminess gets diluted over time due to the influx of other kinds of
spam, and the hamminess sticks because I receive less hams than spams.  So
the more I register these spams to get tokens like "software" and "$60" into
a spammy zone, I'll be diluting other spam tokens out of it.  This is a
problem.  Since my set of spammy tokens is much larger than my set of hammy
tokens, this must be why every token in my database which is ever used in
ham is biased pretty strongly toward ham.

So, if this is in fact the source of the problem, it may still be solved by
changing the center of the min_dev exclusion range.  This would anti-bias
it.  However, maybe this isn't the real solution though.  The real solution
would be to use a different way to calculate the probability so that spammy
tokens cannot be usurped to contribute to a hammy score just because one
gets more spam than ham.

> I'll see about adding a "excl_center" and "excl_magnitude" parameters to
> create an exclusion interval from
>
>   excl_center-excl_magnitude to excl_center+excl_magnitude

That would be nice.  But now I'm more concerned about why the probabilities
don't really reflect what's in the database.  If token X occurs in 100 spams
and 1 ham, then X is spammy no matter how many spams I have in my
.MSG_COUNT, even if the ratio is 1000:1.

Tom