front end

Tue Aug 12 02:12:36 CEST 2003

At 09:48 AM 8/11/03, Matthias Andree wrote:
>[moving the discussion to bogofilter-dev where it belongs, Reply-To and
>MFT set]
>
>Matthias Andree <matthias.andree at gmx.de> writes:
>
> > 1. the "READER" is also a driver and calls into bogofilter. This would
> >    lend itself nicely to a library model that lets applications query a
> >    bogofilter library "look at 3874 bytes from 0x1234567c and return
> >    spamicity".
>
>Of course, we should also be able to read from a stream, because we
>cannot guarantee we can mmap() the input file descriptor, and we may
>want to avoid copying, to $TMPDIR in particular.

Currently, in passthrough mode as bogofilter reads from stdin it saves the 
lines in a linked list of textblocks.  I was thinking that this mechanism 
could be used by the readers.  More specifically, as the input stream is 
broken into messages the lines would be saved in memory in a known 
location.  Once a message is fully in memory, the lexer would be invoked to 
process it.  The lexer uses YY_INPUT to get its input.  For our purposes, 
this would be a routine that returns the textblocks.

Bogofilter also determines whether its input is seekable or not (to avoid 
memory use, if appropriate).  This mechanism is applicable.

As a first try at implementing the new design, I'd likely read a character 
at a time while building a line.  If this turns out to be a bottleneck, 
reading a block would be tried (with lines described via extents - a list 
of (block address, offset, length).

Registration currently doesn't keep a message in memory, though 
classification's passthrough mode does.  Allowing registration to cache the 
message in memory (as passthrough requires), is a simplification (of the 
code) that I is probably worthwhile.

Mailbox registration currently works by tokenizing each message (in a 
wordhash structure), then merging the individual message wordhashes into a 
master wordhash.  At the end of the mailbox, all the tokens in the master 
wordhash are added to the database.

The same plan would continue to work with the new implementation.  The 
change would be that the reader creates the master wordhash, then loops for 
each message to read it, parse it, and add it to the master wordhash.  At 
EOF, the reader sees that the wordhash is added to the database.

At present for maildirs and bulkmode, files are treated individually.  If 
there are lots of files, then the database is updated many times.  Under 
the new design, for maildirs (and bulk mode), the reader would create a 
master wordhash, then loop over the file list doing read/parse/add.  When 
done reading files, the wordhash is added to the database.

AFAICT the same algorithm applies to registration from mailboxes and 
maildirs.  The difference is that maildir handling is much improved over 
the current implementation.

Classification continues to operate with a message at a time.  The reading 
mechanisms above should work as well for classification as they do for 
registration.