[bogofilter] ESF and redundancy
tanderso at oac-design.com
Wed May 12 08:58:05 EDT 2004
From: "David Relson" <relson at osagesoftware.com>
> The bayesian principle on which bogofilter is based, assigns scores to
> each token and computes a final score (via inverse chi-square test) from
> the number of tokens and their scores. "your document is attached" may
> have a score of 1.000000, but bayesian doesn't consider that phrase any
> more important than any other token.
Right, and this is what we're addressing now. The fact is that tokens in
emails are not independent, despite our assumptions. When there is a
relationship between nearby tokens, this information is useful in filtering.
Bogofilter is currently ignoring that.
> What counts is the preponderance of evidence. Conceivably one could
> assign a 4 word phrase an importance of 4 and use it 4 times in the
> computation. Undoubtedly there are a zillion other ways to change the
> importance of a token (or phrase).
Among independent tokens, perceived relationships should be random and
fleeting, while persistent dependence is strongly indicative of a given
message being conveyed. From what I've read of "Markovian discrimination",
a known interrelation between increasing numbers of tokens should receive
superincreasing weight. Therefore, a 4 word phrase should not only receive
4x the weight, but 2^2*(4-1) = 64x. This makes the individual tokens'
scores irrelevant, which is precisely what we would want.
With the weighting you suggested above (4x), this would be the result of a
4-word spammy phrase with hammy tokens: inv_chi_square(0.000000 x 4,
1.000000 x 4) = ~0.5. With markovian(0.000000 x 4, 1.000000 x 64) = ~1.0.
Clearly the latter would be the better result of a known spammy phrase.
Also consider these cases (where @ is a seperator, remember):
user1 at yahoo.com, user2 at hotmail.com, user2 at yahoo.com, user1 at hotmail.com. Now
assume that the first and second addresses belong to friends who email you
often, and the second and third are spammers. Individually, each of the
tokens could be roughly neutral if the frequency of emails is about equal,
however the combinations of these are highly indicative of ham or spam (4x
more than a single token). Thus, although "user1" and "yahoo.com" are both
present in quite a few spams, "user1" followed by "yahoo.com" is only (or
mostly) present in hams and deserves a hammy score.
More information about the Bogofilter