Problem compacting databases (again!)
Matthias Andree
matthias.andree at gmx.de
Sun Jan 23 23:13:08 CET 2005
David Relson <relson at osagesoftware.com> writes:
>> # bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
>> # bogoutil: Unexpected input [d'informÃ] on line 25173. Expecting
>> whitespace before count.
>>
>> It's the same bug last time (the same word also!).
>>
>> I did as David pointed:
>>
>> # bogoutil -d wordlist.db > wordlist.txt
>> # head -25173 wordlist.txt | tail -1
>> d'informà tica 0 1 20050122
>
> It looks like there's an 0xE0 character in that position.
0xe0 in iso-8859-1[5] but 0xc3 0xa0 in UTF-8. And 0xa0 might be mapped
to 0x20 by some of the charset conversion routines that 0.92.X had, or
might trigger
>
> #include <stdio.h>
> #include <ctype.h>
>
> int main(int argc, char **argv)
> {
> char x = 0xE0;
> printf ("0x%02x %d\n", x, isspace(x));
> return (0);
> }
That code is bogus. char x is of undefined signedness, is*() arguments
_must_ be cast to (unsigned char). This is to separate the actual
character data from special state markers, namely EOF.
The right code is:
#include <stdio.h>
#include <ctype.h>
int main(int argc, char **argv)
{
unsigned char x = 0xe0;
printf("0x%02x %d\n", x, isspace(x));
return (0);
}
This "strange" is*() API is there to allow code to distinguish ÿ (in
ISO-8859-1 and -15) from EOF. Note the code can also die on some systems
because isspace(x) may be implemented as something similar to:
#define isspace(x) (__ctype_array[(x)+1] & __ctype_f_space)
and 0xE0 = -0x20 may cause an invalid array access. Some operating
systems actually implement the is*() functions like this, I have seen
heaps of "array subscript may be negative" warnings from GCC running on
Solaris 8.
--
Matthias Andree
More information about the Bogofilter
mailing list