Problem compacting databases (again!)
Juan J. Martinez
reidrac at blackshell.usebox.net
Sun Jan 23 23:13:47 CET 2005
En 23/01/05 22:59, David Relson escribía:
> On Sun, 23 Jan 2005 22:20:29 +0100
> Juan J. Martinez wrote:
>>It happened again:
>>
>># bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
>># bogoutil: Unexpected input [d'informÃ] on line 25173. Expecting
>>whitespace before count.
>>
[...]
> It looks like there's an 0xE0 character in that position.
>
> #include <stdio.h>
> #include <ctype.h>
>
> int main(int argc, char **argv)
> {
> char x = 0xE0;
> printf ("0x%02x %d\n", x, isspace(x));
> return (0);
> }
>
>
> Can you compile and run this program? The output I get is
>
> 0xFFFFFFE0 0
>
> I bet you get 0xFFFFFFE0 1 (or something similar).
$ ./main
0xffffffe0 0
> If I'm right, then bogoutil needs a more thorough check than isspace()
> because OpenBSD is doing something unusual.
I don't know... looking into words.txt I see:
d'informació 0 1 20050120
d'informà 0 3 20050120
d'informà tica 0 1 20050122
d'informàtica 0 1 20050119
d'instal·ladors 0 1 20050119
I'm not sure "d'informÃ" it's a real word but part of "d'informà tica".
Do you mean this is bogoutil bug? May be it's handling in wrong way a
unicode string? Seems isspace is working right...
Juanjo
PS: resend... next time I'll try to remember hit 'reply all'. Would be
the admin of the list so kind and ignore the other mail from
@usebox.net? Sorry :(
--
Desarrollo y Sistemas: http://usebox.net/
Página personal: http://usebox.net/jjm/
More information about the Bogofilter
mailing list