spaced out spam words

Jeff Kinz jkinz at kinz.org
Fri Jun 9 02:44:34 CEST 2006


On Fri, Jun 09, 2006 at 01:51:37AM +0200, Matthias Andree wrote:
> Jeff Kinz <jkinz at kinz.org> writes:
> > My local bf install is having trouble with spam words that have
> > been spaced out.  Which means that the words are rendered with a space
> > between each L E T T E R.  <== like that.
> >
> > They are not getting flagged.  Is anyone else having trouble
> > with content like this not getting flagged by bf?
> 
> Bogofilter does not consider words shorter than three characters a token
> and ignores them. Some of them have been slipping through mine as
> well. On the other hand, messages like these can be smashed out with a
> maildrop or procmail rule as well.

I have found that 
         egrep "[^ ] [^ ] [^ ] [^ ]"  <file> 

Seems to detect them, but since I worry about that being to simplistic
a test i have been transforming then egreppped lines and then running
those lines through bf to see if they score as spam.

sed -e 's/[<>][^<>]*[<>]//g' -e 's/ //g'|  bogofilter  -TT -o 0.49999  |
sed -e 's/[.]//' -e 's/\([01][0-9][0-9]\)\(.*$\)/\1/'`


The problem with this approach is that if they start putting the spaced
words on the same line as other words, (spaced or not),  stripping out
the spaces will run them all together producing something "non-spammy".

Then - even though this works right now - if BF won't notice any token
shorter than three chars, it does me no good to run the original text
through BF and running the transformed text won't help identify this
type of spam?  Is that correct?


-- 
Jeff Kinz, Emergent Research, Hudson, MA.
Speech Recognition Technology was used to create this e-mail




More information about the Bogofilter mailing list