"make check" fails on hp-ux

Matthias Andree matthias.andree at gmx.de
Sun Nov 24 23:05:22 CET 2002


On Sun, 24 Nov 2002, David Relson wrote:

> Notice that at byte positions 15 and 66 the character value is 0x92.  
> linux shows this as an apostrophe and lexer.l accepts it as a valid part 
> of a token.  On hp-ux this character is evidently rejected, so the tokens 
> that bogofilter sees are slightly different.

What was the original character set declaration? iso-8859-1, you say,
when it should have been Windows-1252? We'd catch loads of broken
mailers of this particular kind if we checked for 0x80 (€ in
Windows-1252, unprintable in iso-8859-1) or typographic quotes in
Windows-1252 places when iso-8859-1 is declared. Windows users won't
notice, because the fault is symmetric and Outlook is clueless (as are
many Windows users, evidently).

> My instinct is to modify yyinput() so that it translates 0x92 to 
> apostrophe.  I've attached a patch.  In addition to translating 0x92 to 
> apostrophe, it also translates 0xA0 (known as the "no-break space") to 
> 0x20 (a space).
> 
> Unfortunately, this change changes the reference results, as token 
> "it\x92s" changes from an unknown token to matching "it's", with 
> corresponding spamicity change from 0.415000 to 0.237638.  I'll regenerate 
> the reference results.

Which poses an interesting question: we need "similarity" matching
rather than exact match. apostrophe, aigu and on some keyboards the
grave or backtick (' á à ` in that order, I don't have the regular aigu
here) are used synonymously. As are double-single-quote and
double-quote: '' and ". As are   and regular space. You name some
more ;-) Seriously, we'd need some simple mechanism to treat these the
same in so far as we consider them parts of our tokens.

I used 0x80, and depending on the input character set, I get an
unprintable, €, C-cédille (ç) or the Turkish lookalike (not sure about
its name) or a dash (KOI8-R). It's not as simple as that.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list