decoding implementation

Sun Nov 24 22:48:04 CET 2002

On Sat, 23 Nov 2002, Gyepi SAM wrote:

> I am looking for suggestions on adding 
> base64 and quoted-printable decoding to bogofilter.
> 
> There are two issues I'd like to discuss:
> 
> 1. Data representation. Should we modify the
> [Content-Transfer-Encoding] headers of a message after it has been
> decoded (for consistency and truthfulness) or should we leave the
> headers alone (preserve information)

I have thought about that as well. Basically, we can offer three -p
modes: original, decoded and canonical. original would be whatever the
original mail was (yes I know this is a problem currently on servers
with low RAM that route big emails); decoded would be 8bit, original
character set; and canonical would be 8bit/utf-8 or something.

I'm not in favour of emitting data that has been re-encoded to another
character set by default in passthrough modes, although we need to
canonicalize the character set to make the token list match.

One other thing I can imagine though: How the heck can we treat Greek
omikron, Latin o (oh) and Cyrillic o the same? Three different
characters with the same shape -- we must not get fooled by spammers
abusing these to escape filtering (reference: international DNS, this
problem is under discussion there as well.)

> 2. Data flow. We need to decode the email without necessarily reading
> the entire email into memory. (I know -p does). The options include:
> 
>  a. decode data into a tmp file,rewind, and pass the filehandle to
>  lexer.c
>  
>  b. fork and use pipe() to connect std(in|out) of components.
>  conceivably, we'd have a pipeline equivalent to (cat
>  mail.txt|base64decode|qpdecode|bogofilter) 
>  
>  c. write small programs to implement the actual pipeline 
>
>  d. use coroutines others?
> 
> case a: slow but reliable (have to be careful about file perms and
> race conditions)

I have a working and safe mkstemp replacement for leafnode, including an
above-average quality random number generator, arc4random, from BSD.
This would rid us of all race and symlink attacks even in /tmp. We could
use our own limited tmpdir for additional safety.

> case b: OK

Forking and pipelining safely is going to add some hundred-and-odd lines
of code. Look at leafnode's mailto.c, it's disgusting and ugly and awful
to maintain.
http://m2a2.yi.org/cgi-bin/viewcvs.cgi/leafnode-2/mailto.c?rev=1.6&sortby=date&content-type=text/vnd.viewcvs-markup

The problem with pipes is, you don't easily handle the "writer side
breaks" case, because it's indistinguishable from "write has closed the
connection properly.". You'd have to define a structured format to send
over the pipe, which impairs debugging: you'd need a program to write
and analyze this structured format.

So: objection, because it's far too complex, even though it may perform
good. (It has the other nasty habit of hogging process table slots a lot
if we put like 4 "stations" (programs) into the pipe.

> case c: more unix-like and simpler extension of case b.

Indeed, but the basic problems of a pipe remain.

> case d: more elegant but harder.

I'm not sure I understood that suggestion. "coroutines"? Could you
define this suggestion?

As to the flow suggestion, I'd do the following: check if the stdin is
seekable or check if it's a regular file (*).

If and only if we know by our options we might need to rewind AND stdin
is not a regular file, then copy it to the temp file as we're reading
(like tee(1) would do), and on the second go, read from the temp file.

This approach avoids unnecessary copying.

-----
(*) along the lines of:
stat st s;
if (fstat(fileno(stdin), &s)) choke();
if (!S_ISREG(s.st_mode)) want_tempfile = true;

-- 
Matthias Andree