front end
David Relson
relson at osagesoftware.com
Tue Aug 12 02:12:36 CEST 2003
At 09:48 AM 8/11/03, Matthias Andree wrote:
>[moving the discussion to bogofilter-dev where it belongs, Reply-To and
>MFT set]
>
>Matthias Andree <matthias.andree at gmx.de> writes:
>
> > 1. the "READER" is also a driver and calls into bogofilter. This would
> > lend itself nicely to a library model that lets applications query a
> > bogofilter library "look at 3874 bytes from 0x1234567c and return
> > spamicity".
>
>Of course, we should also be able to read from a stream, because we
>cannot guarantee we can mmap() the input file descriptor, and we may
>want to avoid copying, to $TMPDIR in particular.
Currently, in passthrough mode as bogofilter reads from stdin it saves the
lines in a linked list of textblocks. I was thinking that this mechanism
could be used by the readers. More specifically, as the input stream is
broken into messages the lines would be saved in memory in a known
location. Once a message is fully in memory, the lexer would be invoked to
process it. The lexer uses YY_INPUT to get its input. For our purposes,
this would be a routine that returns the textblocks.
Bogofilter also determines whether its input is seekable or not (to avoid
memory use, if appropriate). This mechanism is applicable.
As a first try at implementing the new design, I'd likely read a character
at a time while building a line. If this turns out to be a bottleneck,
reading a block would be tried (with lines described via extents - a list
of (block address, offset, length).
Registration currently doesn't keep a message in memory, though
classification's passthrough mode does. Allowing registration to cache the
message in memory (as passthrough requires), is a simplification (of the
code) that I is probably worthwhile.
Mailbox registration currently works by tokenizing each message (in a
wordhash structure), then merging the individual message wordhashes into a
master wordhash. At the end of the mailbox, all the tokens in the master
wordhash are added to the database.
The same plan would continue to work with the new implementation. The
change would be that the reader creates the master wordhash, then loops for
each message to read it, parse it, and add it to the master wordhash. At
EOF, the reader sees that the wordhash is added to the database.
At present for maildirs and bulkmode, files are treated individually. If
there are lots of files, then the database is updated many times. Under
the new design, for maildirs (and bulk mode), the reader would create a
master wordhash, then loop over the file list doing read/parse/add. When
done reading files, the wordhash is added to the database.
AFAICT the same algorithm applies to registration from mailboxes and
maildirs. The difference is that maildir handling is much improved over
the current implementation.
Classification continues to operate with a message at a time. The reading
mechanisms above should work as well for classification as they do for
registration.
More information about the Bogofilter
mailing list