"make check" fails on hp-ux
David Relson
relson at osagesoftware.com
Sun Nov 24 23:16:01 CET 2002
At 05:05 PM 11/24/02, Matthias Andree wrote:
>On Sun, 24 Nov 2002, David Relson wrote:
>
> > Notice that at byte positions 15 and 66 the character value is 0x92.
> > linux shows this as an apostrophe and lexer.l accepts it as a valid part
> > of a token. On hp-ux this character is evidently rejected, so the tokens
> > that bogofilter sees are slightly different.
>
>What was the original character set declaration? iso-8859-1, you say,
>when it should have been Windows-1252? We'd catch loads of broken
>mailers of this particular kind if we checked for 0x80 (¤ in
>Windows-1252, unprintable in iso-8859-1) or typographic quotes in
>Windows-1252 places when iso-8859-1 is declared. Windows users won't
>notice, because the fault is symmetric and Outlook is clueless (as are
>many Windows users, evidently).
The message is in file tests/t.systest.d/inputs/msg.3.txt and is:
Content-Type: text/html; charset="us-ascii"
>Which poses an interesting question: we need "similarity" matching
>rather than exact match. apostrophe, aigu and on some keyboards the
>grave or backtick (' á à ` in that order, I don't have the regular aigu
>here) are used synonymously. As are double-single-quote and
>double-quote: '' and ". As are and regular space. You name some
>more ;-) Seriously, we'd need some simple mechanism to treat these the
>same in so far as we consider them parts of our tokens.
Do we need really need similarity? Over time bogofilter will learn all the
spelling variations. I think it needs to have good information as to which
characters belong in tokens. Given that, let the wordlists grow as
bogofilter gets trained.
>I used 0x80, and depending on the input character set, I get an
>unprintable, ¤, C-cédille (ç) or the Turkish lookalike (not sure about
>its name) or a dash (KOI8-R). It's not as simple as that.
Perhaps the temporary solution is to modify the test messages so that they
don't have the problem characters (0x92, etc). When we have a reasonable
charset handler, then we can treat those characters in a better way.
More information about the bogofilter-dev
mailing list