It's getting worse
David Relson
relson at osagesoftware.com
Fri Mar 28 20:09:40 CET 2003
At 01:24 PM 3/28/03, Boris 'pi' Piwinger wrote:
>David Relson <relson at osagesoftware.com> wrote:
>
> >I see the problem. The message is filled with invalid html tags, like
> ><!6502> and <!21722>, and bogfilter treats them as spaces.
>
>Exactly.
>
> >The results would likely be much improved if bogofilter distinguished
> >between correct html comments, i.e. "<!--" ... "-->", and bogus html
> >comments that begin "<!" but don't have the "--".
>
>Don't you mean "not distinguish", i.e., simply drop
><!$WHATEVER>?
pi,
That just reopens the discussion of how bogofilter should handle htlm tags
and comments.
In a properly formed comment, most anything can appear between the pairs of
dashes - even angle brackets. So the following is fine "<!-- <normal>
or >reversed< order -->". If we just handle "<!whatever>" the parsing is
much different.
There's also the distinction between html tags that cause breaks and those
that don't. "long<br>text" should give two tokens, but "long<font...>text"
should give one.
Of course color and "eye space" complicate things. Consider the following
<html><body bgcolor="blue">
<font color="white">white</font>
<font color="blue">blue</font>
<font color="red">red</font>
</body><html>
The word "blue" is rendered in bgcolor, hence is effectively
whitespace. If bogofilter's goal is eye-space, then "blue" is not a token
to process.
It's an interesting (difficult) problem just figuring what _ought_ to be done.
David
More information about the Bogofilter
mailing list