[bogofilter] ESF and redundancy

Thu May 13 14:00:12 CEST 2004

On 13 May 2004 07:35:18 -0400
Tom Anderson wrote:

> On Wed, 2004-05-12 at 22:30, michael at optusnet.com.au wrote:
> > My point here is that the word 'document' will likely wind up
> > with (say) a 0.3 ham rating as a result of appearing in both
> > ham and spam. But the longer phrase is likely to appear
> > in spam only as will wind up at 0.98 (say). The normal algorithm
> > will place more weight on the longer phrase automagically (as it's
> > a more extreme value).
> 
> Assuming you're right, and that 'your', 'is', and 'attached' also have
> a score of 0.3, is a 0.98 score for the whole phrase sufficient to
> outweigh the hamminess of the four component tokens?  I'd think not. 
> My point is that the four-word phrase is not only a little bit more
> indicative of spam, it's tremendously so.  The more words you have in
> a phrase, the more likely that it is communicating a message, and not
> just random or coincidental.  Therefore, the weight of such a phrase
> should be exponentially higher than a single token, or any shorter
> phrase (such as "attached document is", which may be slightly hammy as
> well).

Hi Tom,

As I understand Markovian chaining, "your document is attached" will
generate many tokens, i.e.

    your
    document
    your:document
    is
    document:is
    your:document:is
    attached
    is:attached
    document:is:attached
    your:document:is:attached

This provides a repetitiveness.  I'd guess that long spammy phrases will
include shorter spammy phrases and that, since longer phrases are more
likely to be unique, their scores will be more extreme than shorter.
Using all the tokens may make it unnecessary to include special length
factors in the calculation.

> Note also that this functionality would be particularly useful against
> spam messages crafted by non-native speakers, as they tend to use
> improper grammar which is unlikely to appear in hams.  It would also
> tend counteract the "gibberish" spams which add spacing and other
> seperators where they don't belong, such that "Va.l.ium",
> "Phen.ter.mine", and "Vi.agra" may actually become exponentially more
> spammy rather than just a little bit more.

We've already seen that deliberate misspelling errors like V1agra are
red flags for bogofilter and that the wordlist has multiple tokens for
the multiple spellings of viagra.  Using token chains will cause the
wordlist to grow exponentially.  Such fun!

Anyhow, 'tis an opportunity for someone to do some coding and testing
:-)

Regards,

David