OpenBSD 3.4 isspace() b0rked (was: Problem compacting databases (again!))

Matthias Andree matthias.andree at gmx.de
Mon Jan 24 11:30:16 CET 2005


NB: I've set Mail-Followup-To:, please adjust your recipient list if your
mailer doesn't support this header.

On Mon, 24 Jan 2005, Otto Moerbeek wrote:

> The code determines isspace() assuming ISO8859. 0xa0 is the non-breaking
> space char there.

We are not requesting ISO-8859 treatment or a particular locale, but we
are using the POSIX locale, aka. C locale. Linux w/glibc 2.3.3, FreeBSD
4.11-RC, Solaris 8 all report isspace(0xA0) == 0.

Do OpenBSD plan to review every possible application that it is properly
treating characters in the range 0x80 to 0xff (or -0x01 to -0x80) if
isspace() has returned true? Array subscripts may cause a lot of fun here...

> Looking at Posix
> (http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap07.html):
> ===========================================================================
> space
>
> Define characters to be classified as white-space characters.
>
> In the POSIX locale, at a minimum, the <space>, <form-feed>, <newline>,
> <carriage-return>, <tab>, and <vertical-tab> shall be included.
> ===========================================================================
>
> So extension is allowed in the Posix locale. Seems the man page is not
> right, and the 'only' word has to be scrapped.

Read the whole document please, further down you'll find:

| LC_CTYPE Category in the POSIX Locale
| 
| The character classifications for the POSIX locale follow; the code listing
| depicts the localedef input, and the table represents the same information,
| sorted by character.
| 
| LC_CTYPE
| # The following is the POSIX locale LC_CTYPE.
| # "alpha" is by default "upper" and "lower"
| # "alnum" is by definition "alpha" and "digit"
| # "print" is by default "alnum", "punct", and the <space>
| # "graph" is by default "alnum" and "punct"
| #
| ...
| space    <tab>;<newline>;<vertical-tab>;<form-feed>;\
|          <carriage-return>;<space>
| ...

No mention this is extensible.

The table is exhaustive, particularly no mention of 0xA0 or ISO-8859.

Note particularly that POSIX doesn't even depend on ASCII, see
<http://www.opengroup.org/onlinepubs/000095399/xrat/xbd_chap06.html>

OpenBSD had better constrain itself to ASCII unless an ISO-8859-* locale
is explicitly specified, for portability and security reasons.

--
Matthias Andree



More information about the bogofilter-dev mailing list