HTML parsing

Wed Nov 26 16:06:52 CET 2003

Let us compare the possible combinations:

current bogofilter (parse html according to content-type):
- correct client html parsing -- client hides tags & commments, and
bogofilter ignores them -- good
- incorrect client html parsing -- client shows tags & comments, but
bogofilter ignores them -- bad
- correct client text parsing -- client shows tags & comments, and
bogofilter ranks them -- good
- incorrect client text parsing -- client hides tags & comments, but
bogofilter ranks them -- bad

bogofilter always parse html:
- correct client html parsing -- client hides tags & commments, and
bogofilter ignores them -- good
- incorrect client html parsing -- client shows tags & comments, but
bogofilter ignores them -- bad
- correct client text parsing -- client shows tags & comments, but
bogofilter ignores them -- bad
- incorrect client text parsing -- client hides tags & comments, and
bogofilter also ignores them -- good

bogofilter never parse html:
- correct client html parsing -- client hides tags & commments, but
bogofilter ranks them -- bad
- incorrect client html parsing -- client shows tags & comments, and
bogofilter ranks them -- good
- correct client text parsing -- client shows tags & comments, and
bogofilter ranks them -- good
- incorrect client text parsing -- client hides tags & comments, but
bogofilter ranks them -- bad

So each of the three ways for bogofilter to function end up with the
same number of good and bad scenarios if the number of clients are
balanced in how they parse html and plain-text.  The second two,
however, reward incorrect parsing.  Do enough clients parse it
incorrectly to make it worth breaking standardized clients?

Also note that the third scenario is the only one that provides a "good"
result for people who purposefully turn off HTML parsing or use a
text-only client.

If various clients do it differently, then I suggest we go with the
standard (ie, only parse text/html).  My hunch is that more and more
clients will tend to get more standardized as the stricter xhtml and xml
start being used more often.  This is already true in the browsers. 
Proprietary interpretations of standards just aren't acceptable
anymore.  We should expect that bogofilter won't work as well on
non-standards-compliant clients.  Let's not break it for the correct
ones just to work-around the broken ones.

BTW, Evolution correctly parses plain-text and html.

Tom

On Wed, 2003-11-26 at 07:36, Boris 'pi' Piwinger wrote:
> Boris 'pi' Piwinger wrote:
> 
> > How about always doing HTML parsing?
> 
> I just recall that Eudora by default tries to evaluate every
> HTML it encounters (in text/plain!) and hence does not
> display it. Can someone say what Outlook or Outlook Express do?
> 
> pi
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20031126/5076bf4f/attachment.sig>