wordlists and lexer [was: t.bulkmode problem]

Tue Nov 23 01:45:45 CET 2004

On Tue, 23 Nov 2004 01:31:54 +0100
Matthias Andree wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> >> It passes make check now, but we're still not ready to release, the
> >> multiple-environment is still not backed by a multiple-lock scheme.
> >
> > Multiple wordlists are not commonly used, AFAICT.  Why not document
> > the limitation and go ahead?
> 
> Because we're halfway in the middle of the changes. Documenting the
> limitation would be "you cannot use multiple wordlists", disable the
> --wordlist option, revert the two large commits and move on - not an
> option, as it seems.

No.  It's "you cannot use multiple wordlists with transactions" due to
limitations in BerkeleyDB's environment.  That you can work around the
limitations is nice, but it's added complexity.

> > One possibility is to (for the time being)
> > recommend disabling transactions (use olddb) with multiple
> > wordlists.
> 
> I'd thought about abstracting the datastore a bit more, and adding a
> datastore type, so bogofilter might be able to, for instance, access
> another database.  That might allow the user to access databases in
> different formats or on different machines in the end, but I'm not
> sure if we want this before 1.0.

Being able to support a variety of database types in one run is way, way
out there.  Bogofilter is a spam filter, _not_ a database showcase.  We
have no need for supporting different formats or different machines --
not now, not for 1.0, not for after 1.0

> >> There's also a problem in the message-count parser. Somehow, flex
> >can> propagate junk that starts with a leading space through to
> >collect.c,> which causes a segfault in wordhash_insert because we're
> >stuffing> ULONG_MAX in the marked lines. For some reason, we cannot
> >assume that> we don't have a leading space.
> >> 
> >> .       if (cls == BOGO_LEX_LINE)
> >> .       {
> >> .           char *s = (char *)(yylval->text+1);
> >> >           char *f = strchr(s, ' ') - 1;
> >> .           token->text = (unsigned char *) s;
> >> >           token->leng = f - s;
> >> .       }
> >
> > As message-count files are used to speed scoring when running
> > scoring tests, weaknesses in their handling don't affect production
> > use.
> 
> > Did you encounter the problem with a real file? or with something
> > you created another way?  Having spaces in tokens is bogus -- to the
> > parser spaces are always delimiters.
> 
> With interspersed debug output, it was something that started like
> this
> 
> "          \r    \r1" or similar.
> 
> and was reported as BOGO_LEX_LINE. I'm not sure how lex could consider
> this a BOGO_LEX_LINE, it had only one number.

I suspect bogofilter is also likely to do bad things if you give it a
random file.  It's meant to deal with certain kinds of input, spec.
email, and has some wiggle room for improperly formatted messages.  It's
a tool with a particular purpose.  Asking it to do other things it
beyond its scope.

> I believe it only takes a tiny bug in lexer_v3.l to break nearly every
> assumption that the consumers make.

No different from anywhere else in this program or any other program.  A
small error will always be able to cause a large negative effect.

David