[bogofilter] ESF and redundancy

Tom Anderson tanderso at oac-design.com
Thu May 13 13:35:18 CEST 2004


On Wed, 2004-05-12 at 22:30, michael at optusnet.com.au wrote:
> My point here is that the word 'document' will likely wind up
> with (say) a 0.3 ham rating as a result of appearing in both
> ham and spam. But the longer phrase is likely to appear
> in spam only as will wind up at 0.98 (say). The normal algorithm
> will place more weight on the longer phrase automagically (as it's
> a more extreme value).

Assuming you're right, and that 'your', 'is', and 'attached' also have a
score of 0.3, is a 0.98 score for the whole phrase sufficient to
outweigh the hamminess of the four component tokens?  I'd think not.  My
point is that the four-word phrase is not only a little bit more
indicative of spam, it's tremendously so.  The more words you have in a
phrase, the more likely that it is communicating a message, and not just
random or coincidental.  Therefore, the weight of such a phrase should
be exponentially higher than a single token, or any shorter phrase (such
as "attached document is", which may be slightly hammy as well).

Note also that this functionality would be particularly useful against
spam messages crafted by non-native speakers, as they tend to use
improper grammar which is unlikely to appear in hams.  It would also
tend counteract the "gibberish" spams which add spacing and other
seperators where they don't belong, such that "Va.l.ium",
"Phen.ter.mine", and "Vi.agra" may actually become exponentially more
spammy rather than just a little bit more.

Tom





More information about the Bogofilter mailing list