Ways to trick the lexer

Fri Jun 8 23:08:24 CEST 2007

On Fri, 8 Jun 2007 22:21:01 +0200 Andreas Pardeike
<andreas at pardeike.net> wrote:

> Hi,
> 
> I am getting hundreds of spams with subject "Sexually explicit"
> variations. The create tokens like
> 
> subj:SEIX8UALLY-E8XPLICITI
> 
> in the database and since they vary in at least one letter from
> each other, they all get counts of 1. As a result, none of those
> seemingly random letter will get high spam scores.
> 
> Is this behaviour intented? Wouldn't a higher word count by splitting
> on more boundaries result in i.e.
> 
> subj:UALLY
> ...
> 
> or at least
> 
> subj:SEIX8UALLY
> subj:E8XPLICITI
> 
> ?
>

If you're training them as spam, then they'll get treated as spam once
enough of them have come your way.

None of these sort of spam ever get through my filters, and haven't
done since the first few days I started using bogofilter. They come in
their hundreds and bogofilter learns fast!

-- 
 All the best,
 John