Filters that Fight Back
Matthias Andree
matthias.andree at gmx.de
Mon Aug 11 19:05:51 CEST 2003
"Peter Bishop" <pgb at adelard.com> writes:
> On 11 Aug 2003 at 15:16, Matthias Andree wrote:
>
>> So should we drop the "minimum token size" limit to deal with " B R O K
>> E N U P " tokens?
>>
>
> Or should the tokeniser treat a sequence space-separated single letters
> as a single token? e.g.:
>
> B R O K E N U P
>
> is tokenised as:
>
> B-R-O-K-E-N
> U-P
I don't think so. As Paul Graham suggests: How many ham mails you're
receiving contain
B, E, K, N, O, P, R, U as "words"?
If we took these into account, spammers breaking up words would be doing
us a favour, because they'd deliver 8 spammy tokens (rather than two).
V I A G R A would deliver A, G, I, R, V - I might be "hammish" or
unsure, the A is a border case and depends on case folding, and without
folding, it might be indeterminate. G, R and V would be spammish tokens.
--
Matthias Andree
More information about the Bogofilter
mailing list