pipe chars and the lexer

David Relson relson at osagesoftware.com
Sun Feb 22 21:31:22 CET 2004


On Sun, 22 Feb 2004 20:46:51 +0100 (CET)
Pavel Kankovsky wrote:

> On Sat, 21 Feb 2004, David Relson wrote:
> 
> > Vertical bars are a special character - one of the many that
> > bogofilter uses to delimit tokens.  They can easily be included (see
> > patch below). Question:  do people want "|" included in tokens ???
> 
> There are many other other characters spammers use to break tokens:
> apostrophes, backticks, colons, dots, even spaces (e.g. b l a h),
> plus tiny HTML text, bogus HTML tags, digits and characters like @, |,
> and * replacing similar looking letters.
> 
> What we really need is some generic method to handle as many ways to 
> obscure the token as possible. 
> 
> Crazy idea: would it make any difference if the lower limit on token 
> length was removed? Artificially fragmented words would generate many 
> single- or double- character tokens and I'd bet many of them would be 
> pretty rare in ham.

pi has tried that and likes it.  I tried it and didn't see any
difference.

FWIW, the number of 1 or 2 character tokens is strictly limited.  There
are fewer than 224 possible single character tokens, i.e. 256 less 32
control characters) and there are fewer than 224*224 possible double
character tokens.




More information about the bogofilter-dev mailing list