fatal flex scanner internal error--end of buffer missed

Thu Sep 4 18:27:01 CEST 2003

On Thu, 04 Sep 2003 11:05:23 -0500
"Karl O. Pinc" <kop at meme.com> wrote:

> 
> On 2003.09.04 07:25 David Relson wrote:
> > Karl,
> > 
> > I've looked at all 5 of the messages.  Each begins with a normal
> > "From"
> > line, followed by normal message headers, followed by a normal body,
> > followed by additional "Status: RO", "Content-Length:", and "Lines:"
> > header lines.  These messages are unusual.  I'm not sure whether
> > they comply with the standards are not.  What's their origin?
> > 
> > For example, in #19041 lines 37 to 82 are base64 encoded text. 
> > Lines 83
> > to 85 are:
> > 
> > Status: RO
> > Content-Length: 6224
> > Lines: 157
> 
> Please, no need to apologize.  Y'all are doing _me_ a favor with all
> the work you've done.  (And I've worked around the problem.)
> 
> I rebuilt from the srpm just on principal, no worries about libraray
> compability etc.  (If there's a build requirement autoconf doesn't
> grok you can always use the "Build-Requires:" specfile tag to avoid 
> problems.)
> (I find "rpm --rebuild" the best idiom for installing software not 
> specific
> to my distro release.)
> 
> (FYI:  rpm -q flex --> flex-2.5.4a-1)
> 
> The messages are from my saved spam mbox.  I found them while
> training. Very likely these are not standards conformant messages. 
> I've been collecting spam for years and have used various mail
> clients, of late I find I can't weed myself from the GUI and am using
> balsa (at the moment balsa-1.2.4-7.7.2 but have used older versions.) 
> I suspect the client has sometimes corrupted the mailbox.  Maybe when
> they get really large like my latest spam box (~160MB).  I noticed
> quite a few
> (20?) corrupted messages while carefully cleaning my spam corpus 
> (~30,000
> messages.)  I wouldn't think they _all_ were bad on arrival.  I tried
> to simply delete non-conformant messages when I came across them.
> 
> I _have_ seen some non-conformant spams arrive 'tho, I suspect
> straight from a spammer with faulty software.  I'd think it'd be nice
> to be able to handle them.  No way to trap an exception -- a-la strace
> if nothing else? :(  (Gosh, haven't thought of language hacking in a 
> while.)
> 
> Anyhow, not a big deal.  Although come to think of it I'm using the 
> procmail
> recepie from man 1 bogofiler which rejects the delivery on error, so 
> that
> might get me some sort of a loop should there be a failure. (labeled:
>         # filter mail through bogofilter, tagging it as spam and
>         # updating the wordlists
> )

Karl,

Bogofilter's goal is to handle standards compliant messages.  As we
discover the ways that spammers deviate from the standard (and that are
accepted by popular MUA's), we "loosen" bogofilter's interpretations.  A
small number of non-compliant constructs are already understood.  

Whether a message is compliant or not, bogofilter should never abort. 
With procmail, an abort causes a retry, cause an abort, retry, ...

Since writing to you earlier today, I dug into flex and found
YY_FATAL_ERROR, which calls yy_fatal_error(), which prints the message
and aborts.  I now have a new definition of YY_FATAL_ERROR which uses
setjmp/longjmp.  This at least allows bogofilter to score the message up
to the problem area and will lessen the problems.

David