Bogofilter and zmailer

Thu May 22 00:37:06 CEST 2003

At 05:31 PM 5/21/03, Marek Kowal wrote:
>Hello,
>
>I've been trying to integrate bogofilter with zmailer. Zmailer stores each
>mail in separate file, so running bogofilter in bulk mode allows you to scan
>the contents of specific directory, storing resulting spamicity information
>in separate files, which can be later merged into original mail at the
>delivery stage.
>
>Such setup is very effective, as it does not require the message file to be
>rewritten immediately to store additional X-Bogosity header, but rather, to
>merge it when writing into mailbox - where the message is rewritten anyway.
>
>To explain my problem in more detail, let me begin with the zmailer itself:
>during processing, zmailer puts the envelope in the same file as the
>message. Therefore the "canonical" form of the file is:
>
>Envelope_information_line1
>Envelope_information_line2
>env-end
>Header_line1
>Header_line2
>Header_line3
>
>Body_line1
>Body_line2
>Body_line3
>
>As you can see the '^env-end$' line singnals the end of the envelope and the
>beginning of the message. New, rewritten headers (i.e. new "Recieved: ....")
>headers are written at later stage (this is where I merge the X-Bogosity
>information back).
>
>Now this additional envelope information gives me a lot of headache. If I
>run my bogofilter on the whole file (with the envelope information in) it
>happens quite often that bogofilter lets the message "pass" to the
>receipient marked as ham, even though it was evidently spamish. If I remove
>the envelope part of the message, bogofilter usually recognizes the message
>as spam (correctly).
>
>Since I am really keeen on the speed of the processing of the messages, I do
>not want to prepare envelope-less message for the bogofilter to parse. This
>would eat up additional memory and consume additional IO and cpu cycles. So
>the best idea would be to let the bogofilter decide what to parse and what
>to skip, since it reads the whole file anyway. The ideal solution would be
>to have an additional option which means: skip part of the message untill
>you see the following "line", afterwards start processing the message, as it
>would start right there. Obviously, this would only make sense in bulk mode.
>
>Would you agree on such an extension? I was trying to locate appropriate
>code in sources, but failed to understand the exact processing "scheme" of
>the bogofilter - is the input file read during the YY_INPUT macro execution?
>Would this be the appropriate place to place this additional conditional
>code? Tricky...

Hello Marek,

I'm not too familiar with zmailer, so I won't commit to adding support to 
bogofilter.  Matthias is intimately familiar with all the mailers (as best 
I can tell) and he'll advise whether supporting it is a good idea or not.

Bogofilter defines YY_INPUT to call function yyinput() in lexer.c.  The 
flex scanner calls yyinput() to get a line of text.  The input process is 
several layers deep, for example to decode quoted-printable text.  Anyhow, 
yyinput() is the focal point for what you want.

FWIW, I wrote a few lines of code that do approx what you want.  They 
compile and may actually work.  As I haven't actually tested the code, I 
don't know for certain.  Anyhow, I have attached the patch as it will help 
you on your way.

Let me know how it goes!

David
-------------- next part --------------
Index: lexer.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer.c,v
retrieving revision 1.29
diff -u -r1.29 lexer.c

--- lexer.c	19 May 2003 21:24:47 -0000	1.29
+++ lexer.c	21 May 2003 22:29:27 -0000
@@ -233,11 +233,14 @@
     msg_header = true;
 }
 
+char *env_end = "";
+
 int yyinput(byte *buf, size_t max_size)
 /* input getter for the scanner */
 {
     int i, count = 0;
     buff_t buff;
+    size_t env_end_len;
 
     bool done = false;
 
@@ -250,9 +253,17 @@
      * the flex lexer.
      */
 
+    env_end_len = (env_end == NULL) ? 0 : strlen(env_end);
+	
     while (!done) {
 	done = true;
 	count += get_decoded_line(&buff);
+
+	if (env_end_len != 0) {
+	    if (memcmp(buff.t.text, env_end, env_end_len) == 0)
+		env_end_len = 0;
+	    continue;
+	}
 
 	while (count > (MAXTOKENLEN * 1.5)  && check_alphanum(buff.t.text, count)) {
 	    done = false;