SPAN style="DISPLAY: none" spams

Chris Fortune cfortune at telus.net
Mon Jul 18 23:51:41 CEST 2005


> Since bogofilter extracts the tokens and ignores most of what's within
> html angle brackets, the above is effectively the same as:
>
>  text section:
>   lots of innocent text (part 1)
>   lots of innocent text (part 2)
>   lots of innocent text (part 3)
>
>  html section:
>   some spam


in HTML emails, maybe it makes sense to ignore plain text sections?  Any ham HTML email will contain identical content in plain and
html sections, right?  So testing only the HTML would have no more or less classification errors.  Whereas spam often has different
text and html content, so ignoring the text section would result in less classification error, just as ignoring the content of html
comment tags does.  The only potential for false positives is ham HTML email where the text section is intended to convey the
message and the  HTML section exists but doesn't have a correct copy of the message content, in other words, a broken mail client.
What do you think?




More information about the Bogofilter mailing list