Bogofilter and zmailer

Marek Kowal marek.kowal at portal.onet.pl
Wed May 21 23:31:27 CEST 2003


Hello,

I've been trying to integrate bogofilter with zmailer. Zmailer stores each
mail in separate file, so running bogofilter in bulk mode allows you to scan
the contents of specific directory, storing resulting spamicity information
in separate files, which can be later merged into original mail at the
delivery stage. 

Such setup is very effective, as it does not require the message file to be
rewritten immediately to store additional X-Bogosity header, but rather, to
merge it when writing into mailbox - where the message is rewritten anyway. 

To explain my problem in more detail, let me begin with the zmailer itself:
during processing, zmailer puts the envelope in the same file as the
message. Therefore the "canonical" form of the file is:

Envelope_information_line1
Envelope_information_line2
env-end
Header_line1
Header_line2
Header_line3

Body_line1
Body_line2
Body_line3

As you can see the '^env-end$' line singnals the end of the envelope and the
beginning of the message. New, rewritten headers (i.e. new "Recieved: ....")
headers are written at later stage (this is where I merge the X-Bogosity
information back).

Now this additional envelope information gives me a lot of headache. If I
run my bogofilter on the whole file (with the envelope information in) it
happens quite often that bogofilter lets the message "pass" to the
receipient marked as ham, even though it was evidently spamish. If I remove
the envelope part of the message, bogofilter usually recognizes the message
as spam (correctly).

Since I am really keeen on the speed of the processing of the messages, I do
not want to prepare envelope-less message for the bogofilter to parse. This
would eat up additional memory and consume additional IO and cpu cycles. So
the best idea would be to let the bogofilter decide what to parse and what
to skip, since it reads the whole file anyway. The ideal solution would be
to have an additional option which means: skip part of the message untill
you see the following "line", afterwards start processing the message, as it
would start right there. Obviously, this would only make sense in bulk mode.

Would you agree on such an extension? I was trying to locate appropriate
code in sources, but failed to understand the exact processing "scheme" of
the bogofilter - is the input file read during the YY_INPUT macro execution?
Would this be the appropriate place to place this additional conditional
code? Tricky...

Hoping to hear from you.

Cheers,
Marek




More information about the bogofilter-dev mailing list