[cvs] bogofilter/src lexer.c,1.145,1.146

Sat Dec 31 16:44:54 CET 2005

On Sat, 31 Dec 2005, David Relson wrote:

> > @@ -242,6 +242,7 @@
> >  #endif
> >  
> >      /* CRLF -> NL */
> > +    buf = buff->t.text;
> >      if (count >= 2 && memcmp(buf + count - 2, CRLF, 2) == 0) {
> >  	count --;
> >  	*(buf + count - 1) = (byte) '\n';
> 
> What is this???  buf is set at the beginning of the function via
> 
>    buf = buff->t.text + used
> 
> to account for characters already processed.

> Have you a test case that prompted this change?

It's a bugfix for "invalid read" in valgrind, gzipped and attached for
your convenience (it will be missing on the list). It is one of the
"crashes 0.96.2" test cases. Save the attachment, then run:

gunzip msg.1023.6479.txt.gz
bogolexer -vpCI msg.1023.6479.txt

And you'll get:

==18939== Memcheck, a memory error detector for x86-linux.
==18939== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al.
==18939== Using valgrind-2.4.1, a program supervision framework for x86-linux.
==18939== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
==18939== For more details, rerun with: -v
==18939==
head:From
head:userid
head:example.com
head:Content-Type
head:text
head:html
==18939== Invalid read of size 1
==18939==    at 0x804B87F: get_decoded_line (lexer.c:246)
==18939==    by 0x804BA0B: yyinput (lexer.c:317)
==18939==    by 0x804CFEA: yy_get_next_buffer (lexer_v3.c:2391)
==18939==    by 0x804CDB6: yylex (lexer_v3.c:2226)
==18939==    by 0x804ED34: get_token (token.c:130)
==18939==    by 0x80497B6: main (bogolexer.c:303)
==18939==  Address 0x1BA87553 is not stack'd, malloc'd or (recently) free'd

This address is so far from any buffers that valgrind doesn't recognize it.

That prompted the fix. I'm not sure if it's 100% correct now, at least
valgrind doesn't complain any more.

Another thing in the same context that I've found out about and I've
become more and more inconvenient about the longer I had been tracing
this (for more than four hours now!):

The patch you added to long_token() (same file) between 0.96.2 and 0.96.3
that fixed more bugs with the same input assumes a "long token" when
encountering a NUL byte. Why is this, and why does it work? We have
"baseaddress/count" strings, not NUL-terminated strings.

It appears that the buffer handling is still broken (and has been for ages) and
this code is merely a workaround, not a fix, for yet another buffer overrun:

| static bool long_token(byte *buf, uint count)
| {
|     uint i;
|     for (i=0; i < count; i += 1) {
|         byte c = buf[i];
>         /* 10/23/05 - fix SIGSEGV with msg.1023.6479.txt
>         ** evidently caused by 09/07/05 patch for 0.96.2
>         */
>         if (c == '\0')
>             break;
|         if ((iscntrl(c) || isspace(c) || ispunct(c)) && (c != '_'))
|             return false;
|     }
|     return true;
| }

The actual reason for the crash with the long tokens in the test file is
that we're passing bogus data through yyinput -> ... -> iconvert ->
convert -> iconv() and the crash happens inside iconv() with a corrupt
stack - the lexer appears to have a fixed 16 kB limit, and iconv appears
to write past the buffer's end, because we feed it bogus buffer limits.

I fear that the \0 byte the "fix" above detects is just a sentinel byte
that disappears if we disable your D() and Z() debugging cheats.

I feel *EXTREMELY* uncomfortable with the current buffer handling, long
token handling, shifting and everything else, because I do not
understand the code and it is evidently accumulating stuff without
bounds and only sneaking out the back door because long_token triggers
disposal of buffer contents because it's scared by a \0 byte.

If you could explain and/or comment the buffer handling code in this
area a bit more, that would be much appreciated.

-- 
Matthias Andree