"make check" fails on hp-ux
Matthias Andree
matthias.andree at gmx.de
Sun Nov 24 23:05:22 CET 2002
On Sun, 24 Nov 2002, David Relson wrote:
> Notice that at byte positions 15 and 66 the character value is 0x92.
> linux shows this as an apostrophe and lexer.l accepts it as a valid part
> of a token. On hp-ux this character is evidently rejected, so the tokens
> that bogofilter sees are slightly different.
What was the original character set declaration? iso-8859-1, you say,
when it should have been Windows-1252? We'd catch loads of broken
mailers of this particular kind if we checked for 0x80 (€ in
Windows-1252, unprintable in iso-8859-1) or typographic quotes in
Windows-1252 places when iso-8859-1 is declared. Windows users won't
notice, because the fault is symmetric and Outlook is clueless (as are
many Windows users, evidently).
> My instinct is to modify yyinput() so that it translates 0x92 to
> apostrophe. I've attached a patch. In addition to translating 0x92 to
> apostrophe, it also translates 0xA0 (known as the "no-break space") to
> 0x20 (a space).
>
> Unfortunately, this change changes the reference results, as token
> "it\x92s" changes from an unknown token to matching "it's", with
> corresponding spamicity change from 0.415000 to 0.237638. I'll regenerate
> the reference results.
Which poses an interesting question: we need "similarity" matching
rather than exact match. apostrophe, aigu and on some keyboards the
grave or backtick (' á à ` in that order, I don't have the regular aigu
here) are used synonymously. As are double-single-quote and
double-quote: '' and ". As are and regular space. You name some
more ;-) Seriously, we'd need some simple mechanism to treat these the
same in so far as we consider them parts of our tokens.
I used 0x80, and depending on the input character set, I get an
unprintable, €, C-cédille (ç) or the Turkish lookalike (not sure about
its name) or a dash (KOI8-R). It's not as simple as that.
--
Matthias Andree
More information about the bogofilter-dev
mailing list