lexer, tokens, and content-types

Scott Lenser slenser at cs.cmu.edu
Mon Dec 9 22:10:12 CET 2002


> Gyepi SAM <gyepi at praxis-sw.com> writes:
> 
> > 1. Change the lexer to get it's input from a buffer instead of a *FILE
> 
> > 2. Change bogofilter so it uses an external library to decode incoming
> > mail into a buffer, then passes the buffer to the lexer.  The lexer
> > would never have to learn about Content-Types or any encoding method
> > since it only sees filtered, decoded, data.  The filter could even do
> > stuff like creating pseudo headers so a header like "Subject: Make
> > money fast" would turn into Subject:make, Subject:money, Subject:fast
> > which may provide more context for a word and may be better indicator
> > than the word alone.  Keep in mind that the data we end up tokenising
> > does not have to be in a valid format, it merely needs to be decoded.
> >
> > I plan to start writing some code for this soon, and unless there are
> > any objections, I plan to use eps. It is simple, small, and easy to
> > deal with. From what I have seen of gmime, I cannot say the same of
> > it.
> 
> Objection. EPS has some obvious bugs at first glance. We'd need to FULLY
> scrutinize EPS for RFC compliance first.
> 
> I'd also tend to say "let's avoid copying data around", because that is
> what makes things slow. Let's try to get along with as few passes as
> possible.
> 

I've been using libgmime and here's my opinions on it.  The interface is
slightly complicated but very flexible.  The interface is actually very well
designed and pretty intuitive.  The main problem with the interface is that
while all the methods are well documented as to what types that take, sometimes
the documentation is missing info on the intended way of doing things.  Some
examples are included though mostly in the form of unit tests.  I have a version
derived from bogofilter 0.7 using libgmime 1.90.6 which grabs everything except the
content-encoding fields.  I haven't bothered to try to track down the getting
access to the content-encoding field since it properly decodes it in all cases.
The library seems to be under pretty active development and has good support
for pretty much everything I could think you could want.  It supports a wide
variety of RFCs and seems very interested in following the standards.  It
has a _lot_ of features and is GPL'd.

The following data sources are supported:

memory buffer
FILE *
FILE * with offsets to begin/end of range to use
file descriptor
file descriptor with offset to begin/end of range to use
memory mapped file
memory mapped file with offset to begin/end of range to use

The following types of encoding are handled:

7bit
8bit
binary
base64
quoted printable
uuencode

Support for a large number of RFCs.  From the libgmime README:

WHAT IS GMIME
-------------

GMime is a set of utilities for parsing and creating messages using
the Multipurpose Internet Mail Extension (MIME) as defined by the
following RFCs:

 * 0822: Standard for the Format of Arpa Internet Text Messages
 * 1521: MIME (Multipurpose Internet Mail Extensions) Part One:
         Mechanisms for Specifying and Describing the Format of 
         Internet Message Bodies
 * 1864: The Content-MD5 Header Field (Obsoletes rfc1544)
 * 2045: Multipurpose Internet Mail Extensions (MIME) Part One:
         Format of Internet Message Bodies
 * 2046: Multipurpose Internet Mail Extensions (MIME) Part Two:
         Media Types
 * 2047: Multipurpose Internet Mail Extensions (MIME) Part Three:
         Message Header Extensions for Non-ASCII Text
 * 2048: Multipurpose Internet Mail Extensions (MIME) Part Four:
         Registration Procedures
 * 2049: Multipurpose Internet Mail Extensions (MIME) Part Five:
         Conformance Criteria and Examples
 * 2183: Communicating Presentation Information in Internet Messages:
         The Content-Disposition Header Field
 * 2184: MIME Parameter Value and Encoded Word Extensions: Character
         Sets, Languages, and Continuations
 * 2231: MIME Parameter Value and Encoded Word Extensions: Character
         Sets, Languages, and Continuations (Obsoletes rfc2184)

Other RFCs of interest:

 * 1847: Security Multiparts for MIME: Multipart/Signed and 
         Multipart/Encrypted
 * 1872: The MIME Multipart/Related Content-type
 * 1927: Suggested Additional MIME Types for Associating Documents
 * 2015: MIME Security with Pretty Good Privacy (PGP)
 * 2311: S/MIME Version 2 Message Specification
 * 2312: S/MIME Version 2 Certificate Handling
 * 2387: The Multipart/Related Content-Type.
 * 2424: Content Duration MIME Header Definition
 * 2630: Cryptographic Message Syntax
 * 2632: S/MIME Version 3 Certificate Handling
 * 2633: S/MIME Version 3 Message Specification
 * 2634: Enhanced Security Services for S/MIME
 * 3156: MIME Security with OpenPGP (Updates rfc2015)

Cryptography related RFCs:

 * 2268: A Description of the RC2(r) Encryption Algorithm
 * 2313: PKCS #1: RSA Encryption
 * 2314: PKCS #10: Certification Request Syntax
 * 2315: PKCS #7: Cryptographic Message Syntax
 * 2631: Diffie-Hellman Key Agreement Method

Additional features:

error output redirection (uses glib for this, I don't know how it works
  exactly)
parses from, to, etc lines into list of email addresses.  Parses the email
  addresses into name sections and email address sections
iconv integration.  Supports using iconv to convert character sets.  I
  haven't looked at this much

- Scott Lenser







More information about the bogofilter-dev mailing list