pipe chars and the lexer
Pavel Kankovsky
peak at argo.troja.mff.cuni.cz
Sun Feb 22 20:46:51 CET 2004
On Sat, 21 Feb 2004, David Relson wrote:
> Vertical bars are a special character - one of the many that bogofilter
> uses to delimit tokens. They can easily be included (see patch below).
> Question: do people want "|" included in tokens ???
There are many other other characters spammers use to break tokens:
apostrophes, backticks, colons, dots, even spaces (e.g. b l a h),
plus tiny HTML text, bogus HTML tags, digits and characters like @, |,
and * replacing similar looking letters.
What we really need is some generic method to handle as many ways to
obscure the token as possible.
Crazy idea: would it make any difference if the lower limit on token
length was removed? Artificially fragmented words would generate many
single- or double- character tokens and I'd bet many of them would be
pretty rare in ham.
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the bogofilter-dev
mailing list