thought

Fri May 9 22:43:22 CEST 2003

David Relson <relson at osagesoftware.com> writes:

> My other thought is a bogofilter daemon, to be used via a client-server
> interaction.  The daemon (server) would always be running and
> classification (or registration) would consist of starting the front-end
> (client), passing the message (or tokens or something) to the server,
> and receiving the result.  Is this what you mean?

Close. The idea of "frontend" talking to a "backend" (which you call
client and server) communicates the idea of distinct modules for
distinct types of work. I wouldn't necessarily bind the "backend" to be
a "server" (as in "network server") yet.

A long time ago, I asked about a library that could be used from other
application software. A milter application would come to mind, to name
one possible use. This would be another kind of backend that we could
use.

We currently have the "all-in-one" bogofilter with some dozens of
options, which has been criticized more than once.

One possible protocol between front- and backend would be like
this. It's not thought-out, but the first draft that comes to mind is a
framing protocol, with variable length frames.

1. the first character (1 octet) determines the type of the frame:
   m - mail
   e - end
   x - extension

2. for m-type frames, we'd have the payload length in bytes as
   hexadecimal number, then a LF, then the payload raw, then a '.'

(This borrows a bit from DJB's netstrings,
http://cr.yp.to/proto/netstrings.txt, but I do think hexadecimal is more
efficient than netstrings' decimal numbers -- the code size of a
hexadecimal digit is a factor of the machine's word width usually).

3. e-type frames consist only of the type field, the 'e', and a '.'.

4. x-type frames are reserved for future extension and are currently
   unspecified. They end in a '.'

I wonder if we need request fields in the m-type frames (or c-type
frames, configuration type) to tell the backend what information it is
supposed to retrieve and send back, and if it is supposed to register
data or operate in classify-only mode.

For each frame sent to the backend, the backend would send a reply, I
haven't yet thought what a good protocol would be, because of the
verbose stuff and all that.

Anyway, there may well be better protocols, and we should collect some
ideas and think about them before making a decision.

Of course, if we are within the same binary program (i. e. the modules
are linked together at compile time), the framing protocol causes
additional overhead that can be avoided -- and we don't want huge
buffers either because we can't afford a big memory footprint.

The basic idea is making a more distinct separation of "preprocessing"
(chop mbox into messages and iterate over messages, iterate over
maildir, ...)  and "main processing" (classify, register) stages. The
first step towards this concept would be to draw the border lines (*),
specify the interfaces; the code is only after that. I seems that you
have done some cleanup work already (in main.c) recently; but I'm
lagging behind when it comes to knowing what the code's current shape
is.

(*) I believe we might then have to face (and solve) the issue "how to
    feed a stream through the various stages". Some random thoughts are
    "threads", "fork" with pipes or unix-domain sockets, some driver and
    buffers.

-- 
Matthias Andree