Problem compacting databases (again!)

Matthias Andree matthias.andree at gmx.de
Sun Jan 23 23:13:08 CET 2005


David Relson <relson at osagesoftware.com> writes:

>> # bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
>> # bogoutil: Unexpected input [d'informÃ] on line 25173. Expecting 
>> whitespace before count.
>> 
>> It's the same bug last time (the same word also!).
>> 
>> I did as David pointed:
>> 
>> # bogoutil -d wordlist.db > wordlist.txt
>> # head -25173 wordlist.txt | tail -1
>> d'informà tica 0 1 20050122
>
> It looks like there's an 0xE0 character in that position.

0xe0 in iso-8859-1[5] but 0xc3 0xa0 in UTF-8. And 0xa0 might be mapped
to 0x20 by some of the charset conversion routines that 0.92.X had, or
might trigger 

>
> #include <stdio.h>
> #include <ctype.h>
>
> int main(int argc, char **argv)
> {
>     char x = 0xE0;
>     printf ("0x%02x %d\n", x, isspace(x));
>     return (0);
> }

That code is bogus. char x is of undefined signedness, is*() arguments
_must_ be cast to (unsigned char). This is to separate the actual
character data from special state markers, namely EOF.

The right code is:

#include <stdio.h>
#include <ctype.h>

int main(int argc, char **argv)
{
	unsigned char x = 0xe0;
	printf("0x%02x %d\n", x, isspace(x));
	return (0);
}

This "strange" is*() API is there to allow code to distinguish ÿ (in
ISO-8859-1 and -15) from EOF. Note the code can also die on some systems
because isspace(x) may be implemented as something similar to:

#define isspace(x) (__ctype_array[(x)+1] & __ctype_f_space)

and 0xE0 = -0x20 may cause an invalid array access. Some operating
systems actually implement the is*() functions like this, I have seen
heaps of "array subscript may be negative" warnings from GCC running on
Solaris 8.

-- 
Matthias Andree



More information about the Bogofilter mailing list