HTML again

David Relson relson at osagesoftware.com
Thu May 8 15:15:30 CEST 2003


At 09:05 AM 5/8/03, Jeff Kinz wrote:

>On Thu, May 08, 2003 at 08:11:32AM -0400, David Relson wrote:
> > At 05:15 AM 5/8/03, Boris 'pi' Piwinger wrote:
> > >Today I received several mails in "HTML" which were not
> > >detected. bogolexer shows why. I attach a ZIP file so that
> > >your filter does not see it.
> > Yuck!  The message is full of invalid html tags.  Bogofilter treats 
> them as
> > <br>, while galeon (mozilla) discards them.  Guess it's time to extend the
> > processing of html tags so bogofilter's parsing matches mozilla's.
>
>Is there any possibility that the configuration of invalid HTML tags would be
>valid data for Bogo to do scoring on?
>
>Come to think of it - What about valid HTML? Wouldn't certain patterns of
>those also be good markers for spam/not-spam?

Very possibly so.  The problem with invalid tags is that there are so many 
of them.  A spammer could use a different one every time :-(

The newest versions of bogofilter can be directed to return the "innards" 
of html tags for scoring.  The code is not quite right yet, as 
"prob<junk>lem" returns 3 tokens when 2 would be correct.





More information about the Bogofilter mailing list