spaced out spam words

Fri Jun 9 04:21:24 CEST 2006

On Thu, 8 Jun 2006 20:44:34 -0400
Jeff Kinz wrote:

> On Fri, Jun 09, 2006 at 01:51:37AM +0200, Matthias Andree wrote:
> > Jeff Kinz <jkinz at kinz.org> writes:
> > > My local bf install is having trouble with spam words that have
> > > been spaced out.  Which means that the words are rendered with a
> > > space between each L E T T E R.  <== like that.
> > >
> > > They are not getting flagged.  Is anyone else having trouble
> > > with content like this not getting flagged by bf?
> > 
> > Bogofilter does not consider words shorter than three characters a
> > token and ignores them. Some of them have been slipping through
> > mine as well. On the other hand, messages like these can be smashed
> > out with a maildrop or procmail rule as well.
> 
> I have found that 
>          egrep "[^ ] [^ ] [^ ] [^ ]"  <file> 
> 
> Seems to detect them, but since I worry about that being to simplistic
> a test i have been transforming then egreppped lines and then running
> those lines through bf to see if they score as spam.
> 
> sed -e 's/[<>][^<>]*[<>]//g' -e 's/ //g'|  bogofilter  -TT -o
> 0.49999  | sed -e 's/[.]//' -e 's/\([01][0-9][0-9]\)\(.*$\)/\1/'`
> 
> 
> The problem with this approach is that if they start putting the
> spaced words on the same line as other words, (spaced or not),
> stripping out the spaces will run them all together producing
> something "non-spammy".
> 
> Then - even though this works right now - if BF won't notice any token
> shorter than three chars, it does me no good to run the original text
> through BF and running the transformed text won't help identify this
> type of spam?  Is that correct?

Boris 'pi' Piwinger has been running a customized version of bogofilter
that includes 1 and 2 character tokens in the wordlist and the
calculations.  I've got a patch (somewhere) that allows setting both
minimum and maximum token lengths and could likely find it if you're
interested.  I've thoughts of that patch being a step that would work
together with the ability to make multi-word tokens (with '*'
separators).  For*example here*are several*double-word tokens.

Another approach would be to do some special processing for multiple
single character tokens.  I don't have code that would do that and I
don't know if it's worth testing.

HTH,

David