[cvs] bogofilter/src lexer.c,1.145,1.146

Sat Dec 31 17:26:50 CET 2005

On Sat, 31 Dec 2005 16:44:54 +0100
Matthias Andree wrote:

> On Sat, 31 Dec 2005, David Relson wrote:
> 
> > > @@ -242,6 +242,7 @@
> > >  #endif
> > >  
> > >      /* CRLF -> NL */
> > > +    buf = buff->t.text;
> > >      if (count >= 2 && memcmp(buf + count - 2, CRLF, 2) == 0) {
> > >  	count --;
> > >  	*(buf + count - 1) = (byte) '\n';
> > 
> > What is this???  buf is set at the beginning of the function via
> > 
> >    buf = buff->t.text + used
> > 
> > to account for characters already processed.
> 
> > Have you a test case that prompted this change?
> 
> It's a bugfix for "invalid read" in valgrind, gzipped and attached for
> your convenience (it will be missing on the list). It is one of the
> "crashes 0.96.2" test cases. Save the attachment, then run:
> 
> gunzip msg.1023.6479.txt.gz
> bogolexer -vpCI msg.1023.6479.txt
> 
> And you'll get:
> 
> ==18939== Memcheck, a memory error detector for x86-linux.
> ==18939== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al.
> ==18939== Using valgrind-2.4.1, a program supervision framework for x86-linux.
> ==18939== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
> ==18939== For more details, rerun with: -v
> ==18939==
> head:From
> head:userid
> head:example.com
> head:Content-Type
> head:text
> head:html
> ==18939== Invalid read of size 1
> ==18939==    at 0x804B87F: get_decoded_line (lexer.c:246)
> ==18939==    by 0x804BA0B: yyinput (lexer.c:317)
> ==18939==    by 0x804CFEA: yy_get_next_buffer (lexer_v3.c:2391)
> ==18939==    by 0x804CDB6: yylex (lexer_v3.c:2226)
> ==18939==    by 0x804ED34: get_token (token.c:130)
> ==18939==    by 0x80497B6: main (bogolexer.c:303)
> ==18939==  Address 0x1BA87553 is not stack'd, malloc'd or (recently) free'd
> 
> This address is so far from any buffers that valgrind doesn't recognize it.
> 
> That prompted the fix. I'm not sure if it's 100% correct now, at least
> valgrind doesn't complain any more.
> 
> 
> 
> Another thing in the same context that I've found out about and I've
> become more and more inconvenient about the longer I had been tracing
> this (for more than four hours now!):
> 
> The patch you added to long_token() (same file) between 0.96.2 and 0.96.3
> that fixed more bugs with the same input assumes a "long token" when
> encountering a NUL byte. Why is this, and why does it work? We have
> "baseaddress/count" strings, not NUL-terminated strings.
> 
> It appears that the buffer handling is still broken (and has been for ages) and
> this code is merely a workaround, not a fix, for yet another buffer overrun:
> 
> | static bool long_token(byte *buf, uint count)
> | {
> |     uint i;
> |     for (i=0; i < count; i += 1) {
> |         byte c = buf[i];
> >         /* 10/23/05 - fix SIGSEGV with msg.1023.6479.txt
> >         ** evidently caused by 09/07/05 patch for 0.96.2
> >         */
> >         if (c == '\0')
> >             break;
> |         if ((iscntrl(c) || isspace(c) || ispunct(c)) && (c != '_'))
> |             return false;
> |     }
> |     return true;
> | }
> 
> The actual reason for the crash with the long tokens in the test file is
> that we're passing bogus data through yyinput -> ... -> iconvert ->
> convert -> iconv() and the crash happens inside iconv() with a corrupt
> stack - the lexer appears to have a fixed 16 kB limit, and iconv appears
> to write past the buffer's end, because we feed it bogus buffer limits.
> 
> I fear that the \0 byte the "fix" above detects is just a sentinel byte
> that disappears if we disable your D() and Z() debugging cheats.
> 
> I feel *EXTREMELY* uncomfortable with the current buffer handling, long
> token handling, shifting and everything else, because I do not
> understand the code and it is evidently accumulating stuff without
> bounds and only sneaking out the back door because long_token triggers
> disposal of buffer contents because it's scared by a \0 byte.
> 
> If you could explain and/or comment the buffer handling code in this
> area a bit more, that would be much appreciated.
> 
> -- 
> Matthias Andree

I'll check on your "buf = ..." change by searching my message archive
to find messages that give different results for the new and old
versions of bogolexer.

Confession time:->  I don't fully understand the buffering code,
myself.  When developing it, I started with a vision of what was
needed.  However as the implementation progressed, various conditions
cropped up that needed special handling.  The end product is the result
of tweaking and "debugging into submission".

A problem we've seen is that come messages have sequences of thousands
of alphanumeric characters with no delimiters.  This can cause the
flex's 16kbyte (fixed length) buffer to overflow.  The purpose
long_token() function is to detect a character sequence that's too long
to be an acceptable token, i.e. far longer than MAX_TOKEN_LEN, so that
the characters can be deleted, i.e. exempt from further processing.

The "if (c == '\0')" check that you mention is an example of "doing
what works".  Without this "early out" from the loop, there are
problems. Possibly this loop can be written as

  for (....)
   if ( !isalnum(c))
      return false

or similar.