SPAN style="DISPLAY: none" spams

David Relson relson at osagesoftware.com
Mon Jul 18 23:58:55 CEST 2005


On Mon, 18 Jul 2005 14:51:41 -0700
Chris Fortune wrote:

> > Since bogofilter extracts the tokens and ignores most of what's within
> > html angle brackets, the above is effectively the same as:
> >
> >  text section:
> >   lots of innocent text (part 1)
> >   lots of innocent text (part 2)
> >   lots of innocent text (part 3)
> >
> >  html section:
> >   some spam
> 
> 
> in HTML emails, maybe it makes sense to ignore plain text sections?  Any ham HTML email will contain identical content in plain and
> html sections, right?  So testing only the HTML would have no more or less classification errors.  Whereas spam often has different
> text and html content, so ignoring the text section would result in less classification error, just as ignoring the content of html
> comment tags does.  The only potential for false positives is ham HTML email where the text section is intended to convey the
> message and the  HTML section exists but doesn't have a correct copy of the message content, in other words, a broken mail client.
> What do you think?

One idea that was bandied about long ago was creating a token list for
each mime part, selecting one such part (for example, the spammiest
one), merging its tokens with the message header, then scoring the
resulting set of tokens.

Unfortunately, it's not as easy as it might seem :-<

Regards,

David

P.S.  The patch I sent out earlier spells "span" as "scan" in two places :-<



More information about the Bogofilter mailing list