ideas, what to do? [was: image-only spam ... ]

Thu Dec 14 20:06:06 CET 2006

On Dec 14, 2006, at 7:34 AM, David Relson wrote:

>
> Bogofilter's newest capability, multi-word tokens, was initially
> implemented by an ISP and found effective.  For example, using double
> word tokens the phrase "big difference" becomes 3 tokens, i.e. "big",
> "difference", and "big*difference".  Word combinations provide a
> measure of meaning and context within the message that you don't
> have with single word tokens.  Using double word tokens roughly  
> doubles
> the number of tokens in a message and has a comparable effect on the
> wordlist and processing time.  If you want to go wild with this
> capability, it supports "n" word tokens, i.e. you can set the multiple
> as high as you want, i.e. 2, 3, 5, 10, ...
>
> Personally I'm still using single word tokens.  My incoming spam load
> has increased by 185% since this spring and bogofilter is still doing
> well, though image spam has increased the number of unsures.  This
> month, I'm seeing 7 unsures per 1000 spam with most of them being
> offers of "Office 2007 for $79".  Last month it was a different
> subject causing trouble.

I've been playing with dspam which utilizes two methods that were new  
to me:
chained bayes (which I think is another word for double-word tokens)  
and something called Sparse Binary Polynomial Hashing (sbph) which  
takes a slightly different approach to deterministic probability.

I found SBPH and the chained rules to be extremely effective in  
filtering out the spam.  One of the more effective aspects (I think)  
is that they have a much larger window of examining tokens than just  
two.  One of the things I've found common with the graphic spam is  
that the text consists of random sentences.  They are full  
sentences.  But one tail has nothing to do with the next head of the  
sentence.  Because of the disparity of words within a given proximity  
(say 5 words) the effective discrimination was much improved.

I was (roughly) picking out this graphics spam after 1000 total  
emails examined with >95% overall success.

The downside of SBPH is the process in computationally intensive.  My  
filters went from <1 sec to >10sec processing time on an extremely  
small box.  I'm sure if I had something slightly more capable I  
wouldn't see these problems.  I use an Epia 533MHz 512MB-RAM mini-itx  
box for my spam filtering.  This also has a 1/2 speed FPU so it's  
really going to suffer under computations.

But I thought it worth sharing.  Not sure what else the community is  
doing or what other options are worth pursuing.