lexer, tokens, and content-types

Mon Dec 9 03:52:06 CET 2002

At 08:45 PM 12/8/02, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> > Matthias & Gyepi,
> >
> > Looking at eps, I notice several things.
> >
> > In content.h are #defined lots of content-types,
> > content-transfer-encoding types, and content-disposition types.  In
> > content.c are tables to convert types (as strings) to the defined names.
> > I also see files for converting mime and base64.  What I don't see are
> > connections between the #defines and the code.  For example eps doesn't
> > give us the connection from ENC_BASE64 to base64.c.  It seems we must
> > write our own "glue" code.  I don't see a #define for UUENCODE or code
> > for processing other types they recognize, e.g. 7BIT, 8BIT, QP, or RAW.
> >
> > There is a lot there, but eps isn't complete.  If we want completeness,
> > we need to extend their package (perhaps a whole lot) or use another
> > package.
>
>So we'd go on searching and kill EPS. Can we use C++?

Matthias,

Let's wait to hear Gyepi's opinion.  He has used eps in other projects and 
knows more about it than we do.

I don't expect to find a perfect solution, i.e. one that can just be 
dropped in and will do everything we want.  If we can find a good (partial) 
solution, that'd be better than doing it all, by ourselves.  I figure we 
have other things to do besides reinvent the wheel, don't we?

Regarding C++, I will say that I am strongly opposed to using it.  Using C 
helps keep the project portable and I value that a whole lot.  I also value 
objects and object oriented design.  I'm even willing to use 
"pseudo-objects" like we have in the interface to 
method/graham/robinson/fisher.  In the case of bogofilter, I prefer being 
portable to using objects.

>I know Postfix has a good parser, but it's license (IBM public license)
>is incompatible with the GNU GPL,
>http://www.gnu.org/licenses/license-list.html
>
>Sucks, but that's what it is.
>
>UUDeview ships with a library and is GPL software and looks more
>complete, it claims to support uuencode, xxencode, MIME encodings (qp,
>base64), yenc, ... http://www.fpx.de/fp/Software/UUDeview/
>
>However it was originally aimed at NewsReaders, we'd need to figure if
>it's good enough for bogofilter.

bogofilter's design calls for input to stdin and output to stdout, with the 
two streams being the same (except for the X-Bogosity header line(s) 
specified by the passthrough option).  Handling encoded info calls for a 
decoded stream being passed to the lexer.  The processing thus becomes:

         1. read stdin
         2. if passthrough, save for output
         3. expand for lexer/tokenizer, normalize character set, etc
         4. collect words and compute spam score
         5. if passthrough, output with included X-Bogosity info

The first two and last two steps are presently fine.  No doubt there will 
be refinements and enhancements as time goes by.

The task at hand is the third step, which can be further broken down:

         a. decode attachments
         b. normalize character sets (unicode?)
         c. parse via lexer.l
         d. tokenize - header prefixes, subnet tokens, etc

Given good info from steps a & b, I'm thinking the current parsing (step c) 
is ok.  eps is a possible (partial) solution for (a) and iconv a possible 
solution for (b).  Don't forget my charset code as a beginning for (b), 
though as I said (above), if somebody else's solution will solve our 
problem, we should consider using their solution.  Step d is mostly new 
territory, with needs I don't yet know, though I've heard the spambayes 
project has interesting ideas for that.

That's enough pontificating for now.

David