pipe chars and the lexer

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Sun Feb 22 20:46:51 CET 2004


On Sat, 21 Feb 2004, David Relson wrote:

> Vertical bars are a special character - one of the many that bogofilter
> uses to delimit tokens.  They can easily be included (see patch below).
> Question:  do people want "|" included in tokens ???

There are many other other characters spammers use to break tokens:
apostrophes, backticks, colons, dots, even spaces (e.g. b l a h),
plus tiny HTML text, bogus HTML tags, digits and characters like @, |,
and * replacing similar looking letters.

What we really need is some generic method to handle as many ways to 
obscure the token as possible. 

Crazy idea: would it make any difference if the lower limit on token 
length was removed? Artificially fragmented words would generate many 
single- or double- character tokens and I'd bet many of them would be 
pretty rare in ham.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."





More information about the bogofilter-dev mailing list