t.bulkmode problem

Tue Nov 23 01:31:54 CET 2004

David Relson <relson at osagesoftware.com> writes:

>> It passes make check now, but we're still not ready to release, the
>> multiple-environment is still not backed by a multiple-lock scheme.
>
> Multiple wordlists are not commonly used, AFAICT.  Why not document the
> limitation and go ahead?

Because we're halfway in the middle of the changes. Documenting the
limitation would be "you cannot use multiple wordlists", disable the
--wordlist option, revert the two large commits and move on - not an
option, as it seems.

> One possibility is to (for the time being)
> recommend disabling transactions (use olddb) with multiple wordlists.

I'd thought about abstracting the datastore a bit more, and adding a
datastore type, so bogofilter might be able to, for instance, access
another database.  That might allow the user to access databases in
different formats or on different machines in the end, but I'm not sure
if we want this before 1.0.

>> There's also a problem in the message-count parser. Somehow, flex can
>> propagate junk that starts with a leading space through to collect.c,
>> which causes a segfault in wordhash_insert because we're stuffing
>> ULONG_MAX in the marked lines. For some reason, we cannot assume that
>> we don't have a leading space.
>> 
>> .       if (cls == BOGO_LEX_LINE)
>> .       {
>> .           char *s = (char *)(yylval->text+1);
>> >           char *f = strchr(s, ' ') - 1;
>> .           token->text = (unsigned char *) s;
>> >           token->leng = f - s;
>> .       }
>
> As message-count files are used to speed scoring when running scoring
> tests, weaknesses in their handling don't affect production use.

> Did you encounter the problem with a real file? or with something you
> created another way?  Having spaces in tokens is bogus -- to the
> parser spaces are always delimiters.

With interspersed debug output, it was something that started like this

"          \r    \r1" or similar.

and was reported as BOGO_LEX_LINE. I'm not sure how lex could consider this
a BOGO_LEX_LINE, it had only one number.

I believe it only takes a tiny bug in lexer_v3.l to break nearly every
assumption that the consumers make.

-- 
Matthias Andree