Filters that Fight Back

Matthias Andree matthias.andree at gmx.de
Mon Aug 11 19:05:51 CEST 2003


"Peter Bishop" <pgb at adelard.com> writes:

> On 11 Aug 2003 at 15:16, Matthias Andree wrote:
>
>> So should we drop the "minimum token size" limit to deal with " B R O K
>> E N   U P " tokens?
>> 
>
> Or should the tokeniser treat a sequence space-separated single letters
> as a single token? e.g.:
>
> B R O K E N   U P
>
> is tokenised as:
>
> B-R-O-K-E-N
> U-P

I don't think so. As Paul Graham suggests: How many ham mails you're
receiving contain

B, E, K, N, O, P, R, U as "words"?

If we took these into account, spammers breaking up words would be doing
us a favour, because they'd deliver 8 spammy tokens (rather than two).

V I A G R A would deliver A, G, I, R, V - I might be "hammish" or
unsure, the A is a border case and depends on case folding, and without
folding, it might be indeterminate. G, R and V would be spammish tokens.

-- 
Matthias Andree




More information about the Bogofilter mailing list