HTML again
David Relson
relson at osagesoftware.com
Thu May 8 15:15:30 CEST 2003
At 09:05 AM 5/8/03, Jeff Kinz wrote:
>On Thu, May 08, 2003 at 08:11:32AM -0400, David Relson wrote:
> > At 05:15 AM 5/8/03, Boris 'pi' Piwinger wrote:
> > >Today I received several mails in "HTML" which were not
> > >detected. bogolexer shows why. I attach a ZIP file so that
> > >your filter does not see it.
> > Yuck! The message is full of invalid html tags. Bogofilter treats
> them as
> > <br>, while galeon (mozilla) discards them. Guess it's time to extend the
> > processing of html tags so bogofilter's parsing matches mozilla's.
>
>Is there any possibility that the configuration of invalid HTML tags would be
>valid data for Bogo to do scoring on?
>
>Come to think of it - What about valid HTML? Wouldn't certain patterns of
>those also be good markers for spam/not-spam?
Very possibly so. The problem with invalid tags is that there are so many
of them. A spammer could use a different one every time :-(
The newest versions of bogofilter can be directed to return the "innards"
of html tags for scoring. The code is not quite right yet, as
"prob<junk>lem" returns 3 tokens when 2 would be correct.
More information about the Bogofilter
mailing list