Ways to trick the lexer
John G Walker
johngwalker at tiscali.co.uk
Fri Jun 8 23:08:24 CEST 2007
On Fri, 8 Jun 2007 22:21:01 +0200 Andreas Pardeike
<andreas at pardeike.net> wrote:
> Hi,
>
> I am getting hundreds of spams with subject "Sexually explicit"
> variations. The create tokens like
>
> subj:SEIX8UALLY-E8XPLICITI
>
> in the database and since they vary in at least one letter from
> each other, they all get counts of 1. As a result, none of those
> seemingly random letter will get high spam scores.
>
> Is this behaviour intented? Wouldn't a higher word count by splitting
> on more boundaries result in i.e.
>
> subj:UALLY
> ...
>
> or at least
>
> subj:SEIX8UALLY
> subj:E8XPLICITI
>
> ?
>
If you're training them as spam, then they'll get treated as spam once
enough of them have come your way.
None of these sort of spam ever get through my filters, and haven't
done since the first few days I started using bogofilter. They come in
their hundreds and bogofilter learns fast!
--
All the best,
John
More information about the Bogofilter
mailing list