lexer, tokens, and content-types
David Relson
relson at osagesoftware.com
Mon Dec 9 03:52:06 CET 2002
At 08:45 PM 12/8/02, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > Matthias & Gyepi,
> >
> > Looking at eps, I notice several things.
> >
> > In content.h are #defined lots of content-types,
> > content-transfer-encoding types, and content-disposition types. In
> > content.c are tables to convert types (as strings) to the defined names.
> > I also see files for converting mime and base64. What I don't see are
> > connections between the #defines and the code. For example eps doesn't
> > give us the connection from ENC_BASE64 to base64.c. It seems we must
> > write our own "glue" code. I don't see a #define for UUENCODE or code
> > for processing other types they recognize, e.g. 7BIT, 8BIT, QP, or RAW.
> >
> > There is a lot there, but eps isn't complete. If we want completeness,
> > we need to extend their package (perhaps a whole lot) or use another
> > package.
>
>So we'd go on searching and kill EPS. Can we use C++?
Matthias,
Let's wait to hear Gyepi's opinion. He has used eps in other projects and
knows more about it than we do.
I don't expect to find a perfect solution, i.e. one that can just be
dropped in and will do everything we want. If we can find a good (partial)
solution, that'd be better than doing it all, by ourselves. I figure we
have other things to do besides reinvent the wheel, don't we?
Regarding C++, I will say that I am strongly opposed to using it. Using C
helps keep the project portable and I value that a whole lot. I also value
objects and object oriented design. I'm even willing to use
"pseudo-objects" like we have in the interface to
method/graham/robinson/fisher. In the case of bogofilter, I prefer being
portable to using objects.
>I know Postfix has a good parser, but it's license (IBM public license)
>is incompatible with the GNU GPL,
>http://www.gnu.org/licenses/license-list.html
>
>Sucks, but that's what it is.
>
>UUDeview ships with a library and is GPL software and looks more
>complete, it claims to support uuencode, xxencode, MIME encodings (qp,
>base64), yenc, ... http://www.fpx.de/fp/Software/UUDeview/
>
>However it was originally aimed at NewsReaders, we'd need to figure if
>it's good enough for bogofilter.
bogofilter's design calls for input to stdin and output to stdout, with the
two streams being the same (except for the X-Bogosity header line(s)
specified by the passthrough option). Handling encoded info calls for a
decoded stream being passed to the lexer. The processing thus becomes:
1. read stdin
2. if passthrough, save for output
3. expand for lexer/tokenizer, normalize character set, etc
4. collect words and compute spam score
5. if passthrough, output with included X-Bogosity info
The first two and last two steps are presently fine. No doubt there will
be refinements and enhancements as time goes by.
The task at hand is the third step, which can be further broken down:
a. decode attachments
b. normalize character sets (unicode?)
c. parse via lexer.l
d. tokenize - header prefixes, subnet tokens, etc
Given good info from steps a & b, I'm thinking the current parsing (step c)
is ok. eps is a possible (partial) solution for (a) and iconv a possible
solution for (b). Don't forget my charset code as a beginning for (b),
though as I said (above), if somebody else's solution will solve our
problem, we should consider using their solution. Step d is mostly new
territory, with needs I don't yet know, though I've heard the spambayes
project has interesting ideas for that.
That's enough pontificating for now.
David
More information about the bogofilter-dev
mailing list