SPAN style="DISPLAY: none" spams

David Relson relson at osagesoftware.com
Mon Jul 18 23:32:31 CEST 2005


On Mon, 18 Jul 2005 10:29:29 -0700
Chris Fortune wrote:

> been noticing well formed MIME mails that are defeating bogofilter lately, with scores around .3 :
> 
> 
> 
> text section:
> lots of innocent text
> 
> html section:
> <SPAN style=3D"DISPLAY: none">lots of innocent text</span>
>  some spam<SPAN style=3D"DISPLAY: none">lots of innocent text</span>-ish text.
> 
> 
> 
> 
> mostly innocent text in these mails.  5kb to deliver a one line "cheap meds" spam

Hi Chris,

It's not really anything new, at least as concerns bogofilter.

Since bogofilter extracts the tokens and ignores most of what's within
html angle brackets, the above is effectively the same as:

 text section:
  lots of innocent text (part 1)
  lots of innocent text (part 2)
  lots of innocent text (part 3)

 html section:
  some spam

Whether there's 1 big chunk of innocent text or 3 little ones doesn't
matter.  If you want to see which tokens bogofilter is using to score
the message, run

  bogofilter -vvv < msg | grep "+$"

Bogofilter's parser could be modified to disregard <span...> ...
</span>.  If you're interested in experimenting, I've created a patch
that _should_ do what you want.  It definitely compiles, but I can't
say for sure if it truly does what you want.

Enjoy your experimentation ;->

HTH,

David

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.0718.lexer_v3.l
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20050718/c3823bdb/attachment.ksh>


More information about the Bogofilter mailing list