pipe chars and the lexer
David Relson
relson at osagesoftware.com
Sun Feb 22 21:31:22 CET 2004
On Sun, 22 Feb 2004 20:46:51 +0100 (CET)
Pavel Kankovsky wrote:
> On Sat, 21 Feb 2004, David Relson wrote:
>
> > Vertical bars are a special character - one of the many that
> > bogofilter uses to delimit tokens. They can easily be included (see
> > patch below). Question: do people want "|" included in tokens ???
>
> There are many other other characters spammers use to break tokens:
> apostrophes, backticks, colons, dots, even spaces (e.g. b l a h),
> plus tiny HTML text, bogus HTML tags, digits and characters like @, |,
> and * replacing similar looking letters.
>
> What we really need is some generic method to handle as many ways to
> obscure the token as possible.
>
> Crazy idea: would it make any difference if the lower limit on token
> length was removed? Artificially fragmented words would generate many
> single- or double- character tokens and I'd bet many of them would be
> pretty rare in ham.
pi has tried that and likes it. I tried it and didn't see any
difference.
FWIW, the number of 1 or 2 character tokens is strictly limited. There
are fewer than 224 possible single character tokens, i.e. 256 less 32
control characters) and there are fewer than 224*224 possible double
character tokens.
More information about the bogofilter-dev
mailing list