wordlists and lexer

Matthias Andree matthias.andree at gmx.de
Tue Nov 23 02:00:51 CET 2004


David Relson <relson at osagesoftware.com> writes:

>> Because we're halfway in the middle of the changes. Documenting the
>> limitation would be "you cannot use multiple wordlists", disable the
>> --wordlist option, revert the two large commits and move on - not an
>> option, as it seems.
>
> No.  It's "you cannot use multiple wordlists with transactions" due to
> limitations in BerkeleyDB's environment.  That you can work around the
> limitations is nice, but it's added complexity.

I think I'll review and back out the large database environment commits
and replace the --wordlist option by a message "You cannot use multiple
wordlists with Berkeley DB Transactional Data Store. See section X.Y in
file README.abc for details." This is a larger task and it's bedtime
now, and some of the fixes entailed in the larger updates we'll want to
keep.

>> > As message-count files are used to speed scoring when running
>> > scoring tests, weaknesses in their handling don't affect production
>> > use.
>> 
>> > Did you encounter the problem with a real file? or with something
>> > you created another way?  Having spaces in tokens is bogus -- to the
>> > parser spaces are always delimiters.
>> 
>> With interspersed debug output, it was something that started like
>> this
>> 
>> "          \r    \r1" or similar.
>> 
>> and was reported as BOGO_LEX_LINE. I'm not sure how lex could consider
>> this a BOGO_LEX_LINE, it had only one number.
>
> I suspect bogofilter is also likely to do bad things if you give it a
> random file.  It's meant to deal with certain kinds of input, spec.
> email, and has some wiggle room for improperly formatted messages.  It's
> a tool with a particular purpose.  Asking it to do other things it
> beyond its scope.

Yes, but crashing on bogus input is also beyond its scope. It runs in a
mail environment, and any bug that is triggered by bogus input can also
be triggered by a remote user. We absolutely must not allow SIGSEGV
here.

I'll go fix this bug by just ignoring the line and reading the next
token - someone else can then fix the lexer so we don't pass data down
that beats the crap out of collect.c.

>> I believe it only takes a tiny bug in lexer_v3.l to break nearly every
>> assumption that the consumers make.
>
> No different from anywhere else in this program or any other program.  A
> small error will always be able to cause a large negative effect.

We cannot allow this in a mail scoring application. I don't mind if
bogotune of bogoutil barfs on a non-crucial function once in a while, we
can fix this after the bug report, but it poses no immediate danger to
the end user's system.

Bogofilter itself must not crash, particularly not when running in some
non-registering mode, else the mail system might be unable to make any
progress with its duties - and parsers are crash-prone.

I fear 0.92.9 has already set out as a bugfix update. We'll see if
something knocks on our doors.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list