SPAN style="DISPLAY: none" spams

David Relson relson at osagesoftware.com
Mon Jul 18 23:32:31 CEST 2005


On Mon, 18 Jul 2005 10:29:29 -0700
Chris Fortune wrote:

> been noticing well formed MIME mails that are defeating bogofilter lately, with scores around .3 :
> 
> 
> 
> text section:
> lots of innocent text
> 
> html section:
> <SPAN style=3D"DISPLAY: none">lots of innocent text</span>
>  some spam<SPAN style=3D"DISPLAY: none">lots of innocent text</span>-ish text.
> 
> 
> 
> 
> mostly innocent text in these mails.  5kb to deliver a one line "cheap meds" spam

Hi Chris,

It's not really anything new, at least as concerns bogofilter.

Since bogofilter extracts the tokens and ignores most of what's within
html angle brackets, the above is effectively the same as:

 text section:
  lots of innocent text (part 1)
  lots of innocent text (part 2)
  lots of innocent text (part 3)

 html section:
  some spam

Whether there's 1 big chunk of innocent text or 3 little ones doesn't
matter.  If you want to see which tokens bogofilter is using to score
the message, run

  bogofilter -vvv < msg | grep "+$"

Bogofilter's parser could be modified to disregard <span...> ...
</span>.  If you're interested in experimenting, I've created a patch
that _should_ do what you want.  It definitely compiles, but I can't
say for sure if it truly does what you want.

Enjoy your experimentation ;->

HTH,

David

-------------- next part --------------
Index: lexer_v3.l
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer_v3.l,v
retrieving revision 1.162
diff -u -r1.162 lexer_v3.l
--- src/lexer_v3.l	27 Jun 2005 00:40:48 -0000	1.162
+++ src/lexer_v3.l	18 Jul 2005 21:22:49 -0000
@@ -213,7 +213,7 @@
 VERP		{TOKEN}-{VERPID}-{TOKEN}={TOKEN}@{TOKEN}
 
 %s TEXT HTML BOGO_LEX
-%s HTOKEN HDISCARD SCOMMENT LCOMMENT
+%s HTOKEN HDISCARD SCOMMENT LCOMMENT PCOMMENT
 %s PGP_HEAD PGP_BODY
 
 %%
@@ -297,14 +297,16 @@
 
 <HTML>"<!--"					{ BEGIN SCOMMENT; }
 <HTML>"<!"					{ BEGIN LCOMMENT; }
+<HTML>"<scan"					{ BEGIN PCOMMENT; }	/* beginning of "span" comment */
 <HTML>"<"(a|img|font){WHITESPACE}		{ BEGIN HTOKEN; }
 <HTML>"<"					{ BEGIN HDISCARD; }	/* unknown tag */
 
 <HTOKEN>{TOKEN}					{ return TOKEN; }
-<HDISCARD,LCOMMENT,SCOMMENT>{TOKEN}		{ /* discard innards of html tokens and comments */ }
+<HDISCARD,LCOMMENT,SCOMMENT,PCOMMENT>{TOKEN}	{ /* discard innards of html tokens and comments */ }
 
 <HTOKEN,HDISCARD,LCOMMENT>">"			{ BEGIN HTML; }	/* end of tag, loose comment; return to normal html processing */
 <SCOMMENT>"-->"					{ BEGIN HTML; }	/* end of strict comment; return to normal html processing */
+<PCOMMENT>"</scan>"				{ BEGIN HTML; }	/* end of "span" comment; return to normal html processing */
 "<"\!DOCTYPE\ HTML\ PUBLIC\ .*">" 		{ BEGIN HTML; }
 
 {IPADDR}					{ return IPADDR;}


More information about the bogofilter mailing list