Ways to trick the lexer
Andreas Pardeike
andreas at pardeike.net
Fri Jun 8 22:21:01 CEST 2007
Hi,
I am getting hundreds of spams with subject "Sexually explicit"
variations. The create tokens like
subj:SEIX8UALLY-E8XPLICITI
in the database and since they vary in at least one letter from
each other, they all get counts of 1. As a result, none of those
seemingly random letter will get high spam scores.
Is this behaviour intented? Wouldn't a higher word count by splitting
on more boundaries result in i.e.
subj:UALLY
...
or at least
subj:SEIX8UALLY
subj:E8XPLICITI
?
Regards,
Andreas Pardeike
More information about the Bogofilter
mailing list