lexer, tokens, and content-types

Mon Dec 9 01:47:34 CET 2002

Gyepi SAM <gyepi at praxis-sw.com> writes:

> 1. Change the lexer to get it's input from a buffer instead of a *FILE

> 2. Change bogofilter so it uses an external library to decode incoming
> mail into a buffer, then passes the buffer to the lexer.  The lexer
> would never have to learn about Content-Types or any encoding method
> since it only sees filtered, decoded, data.  The filter could even do
> stuff like creating pseudo headers so a header like "Subject: Make
> money fast" would turn into Subject:make, Subject:money, Subject:fast
> which may provide more context for a word and may be better indicator
> than the word alone.  Keep in mind that the data we end up tokenising
> does not have to be in a valid format, it merely needs to be decoded.
>
> I plan to start writing some code for this soon, and unless there are
> any objections, I plan to use eps. It is simple, small, and easy to
> deal with. From what I have seen of gmime, I cannot say the same of
> it.

Objection. EPS has some obvious bugs at first glance. We'd need to FULLY
scrutinize EPS for RFC compliance first.

I'd also tend to say "let's avoid copying data around", because that is
what makes things slow. Let's try to get along with as few passes as
possible.

> So, in conclusion, I would suggest that we think about redesigning the
> internal structure of bogofilter to allow for the use of an external
> decoding library.

That's what it boils down to one way or the other.

-- 
Matthias Andree