Learning Backscatter

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Sat Jan 10 21:42:16 CET 2009


On Sat, 10 Jan 2009, David Relson wrote:

> Bogofilter processes various types of attachments. [...]
> So, the attached original email will be scanned and scored.

There are many forms of bounces and there are many problematic cases:

1. The MIME part containing the original message has a wrong type (e.g.  
text/plain rather than message/rfc822) or the original message is not
included as a MIME part but embedded in its text. This is tricky, some
messages might be tokenized more or less correctly, others (e.g. Base64
encoded) might produce to few usable tokens. As far as I can tell 
one third to one half of all bounces falls into this category.

I myself decided to preprocess bounces with a script that extracts the 
original message (or as much of it as possible) before letting Bogofilter 
look at it. The script is rather complex because there are tens of 
different formats I had to teach it.

2. The bounce does not contain any usable part of the original message
(nothing but the original envelope sender in the most extreme form; yes,
you can see such bounces in the wild!). This case is hopeless (as far 
as Bogofilter is concerned) for obvious reasons. It makes several percents 
of all bounces.

-- 
Pavel Kankovsky aka Peak                          / Jeremiah 9:21        \
"For death is come up into our MS Windows(tm)..." \ 21th century edition /




More information about the Bogofilter mailing list