It's getting worse

David Relson relson at osagesoftware.com
Fri Mar 28 20:09:40 CET 2003


At 01:24 PM 3/28/03, Boris 'pi' Piwinger wrote:

>David Relson <relson at osagesoftware.com> wrote:
>
> >I see the problem.  The message is filled with invalid html tags, like
> ><!6502> and <!21722>, and bogfilter treats them as spaces.
>
>Exactly.
>
> >The results would likely be much improved if bogofilter distinguished
> >between correct html comments, i.e. "<!--" ... "-->", and bogus html
> >comments that begin "<!" but don't have the "--".
>
>Don't you mean "not distinguish", i.e., simply drop
><!$WHATEVER>?

pi,

That just reopens the discussion of how bogofilter should handle htlm tags 
and comments.

In a properly formed comment, most anything can appear between the pairs of 
dashes - even angle brackets.  So the following is fine "<!-- <normal> 
or >reversed< order -->".  If we just handle "<!whatever>" the parsing is 
much different.

There's also the distinction between html tags that cause breaks and those 
that don't.  "long<br>text" should give two tokens, but "long<font...>text" 
should give one.

Of course color and "eye space" complicate things.  Consider the following

<html><body bgcolor="blue">
<font color="white">white</font>
<font color="blue">blue</font>
<font color="red">red</font>
</body><html>

The word "blue" is rendered in bgcolor, hence is effectively 
whitespace.  If bogofilter's goal is eye-space, then "blue" is not a token 
to process.

It's an interesting (difficult) problem just figuring what _ought_ to be done.

David






More information about the Bogofilter mailing list