pipe chars and the lexer

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Sun Feb 22 21:25:59 CET 2004


"Pavel Kankovsky" <peak at argo.troja.mff.cuni.cz> wrote:

>> Vertical bars are a special character - one of the many that bogofilter
>> uses to delimit tokens.  They can easily be included (see patch below).
>> Question:  do people want "|" included in tokens ???
>
>There are many other other characters spammers use to break tokens:
>apostrophes, backticks, colons, dots, even spaces (e.g. b l a h),
>plus tiny HTML text, bogus HTML tags, digits and characters like @, |,
>and * replacing similar looking letters.

Right, causes no problem.

>What we really need is some generic method to handle as many ways to 
>obscure the token as possible. 

The bad thing about the standard lexer is it ignores those
one letter tokens. Modified versions of the lexer accept
those. I use that for quite some time successfully:
http://piology.org/bogofilter/

>Crazy idea: would it make any difference if the lower limit on token 
>length was removed? 

That is not crazy and has been discussed before. I believe
the reason we have this lower limit is the way the token
expression is build. This wants something different at the
beginning, in the middle and at the end. The lexers above
reduce that significantly and accept short tokens. Try them
if you like.

pi




More information about the bogofilter-dev mailing list