"make check" fails on hp-ux

Sun Nov 24 23:16:01 CET 2002

At 05:05 PM 11/24/02, Matthias Andree wrote:

>On Sun, 24 Nov 2002, David Relson wrote:
>
> > Notice that at byte positions 15 and 66 the character value is 0x92.
> > linux shows this as an apostrophe and lexer.l accepts it as a valid part
> > of a token.  On hp-ux this character is evidently rejected, so the tokens
> > that bogofilter sees are slightly different.
>
>What was the original character set declaration? iso-8859-1, you say,
>when it should have been Windows-1252? We'd catch loads of broken
>mailers of this particular kind if we checked for 0x80 (¤ in
>Windows-1252, unprintable in iso-8859-1) or typographic quotes in
>Windows-1252 places when iso-8859-1 is declared. Windows users won't
>notice, because the fault is symmetric and Outlook is clueless (as are
>many Windows users, evidently).

The message is in file tests/t.systest.d/inputs/msg.3.txt and is:

         Content-Type: text/html; charset="us-ascii"

>Which poses an interesting question: we need "similarity" matching
>rather than exact match. apostrophe, aigu and on some keyboards the
>grave or backtick (' á à ` in that order, I don't have the regular aigu
>here) are used synonymously. As are double-single-quote and
>double-quote: '' and ". As are   and regular space. You name some
>more ;-) Seriously, we'd need some simple mechanism to treat these the
>same in so far as we consider them parts of our tokens.

Do we need really need similarity?  Over time bogofilter will learn all the 
spelling variations.  I think it needs to have good information as to which 
characters belong in tokens.  Given that, let the wordlists grow as 
bogofilter gets trained.

>I used 0x80, and depending on the input character set, I get an
>unprintable, ¤, C-cédille (ç) or the Turkish lookalike (not sure about
>its name) or a dash (KOI8-R). It's not as simple as that.

Perhaps the temporary solution is to modify the test messages so that they 
don't have the problem characters (0x92, etc).  When we have a reasonable 
charset handler, then we can treat those characters in a better way.