Ways to trick the lexer

Andreas Pardeike andreas at pardeike.net
Fri Jun 8 22:21:01 CEST 2007


Hi,

I am getting hundreds of spams with subject "Sexually explicit"
variations. The create tokens like

subj:SEIX8UALLY-E8XPLICITI

in the database and since they vary in at least one letter from
each other, they all get counts of 1. As a result, none of those
seemingly random letter will get high spam scores.

Is this behaviour intented? Wouldn't a higher word count by splitting
on more boundaries result in i.e.

subj:UALLY
...

or at least

subj:SEIX8UALLY
subj:E8XPLICITI

?

Regards,
Andreas Pardeike



More information about the Bogofilter mailing list