A Suggestion [was: multipart spam]

Bill McClain wmcclain at salamander.com
Thu Dec 9 21:09:32 CET 2004


On Sun, 14 Nov 2004 08:33:11 -0500
David Relson <relson at osagesoftware.com> wrote:

> A reasonable approach for testing such ideas would be a perl script
> (or python program) to separate a message into its parts, score them
> separately, and see what the result gives.  I'd suggest having the
> header be one part (scored using your usual bogofilter flags) and
> having each mime part be scored (using usual flags plus '-H').

This is from last month. The topic was whether there is any value in
separate scoring for the parts of multipart messages. I wrote a Python
program to gather data on this back then and am finally reporting the
results. The attachment contains both the program and example output.

I consider only text/plain and text/html parts; should I look at
others? There is also a flaw that reduces accuracy: I am not decoding
base64 or quoted-printable parts as bogofilter does. The Python
email module is supposed to do that but I am having trouble with the
pertinent method. 

I look at messages that have two or more text attachments and print a
line for each message showing spamicity for the whole message, for just
the header, and for each part.

I don't know what statistics would be valuable for analysis. Perhaps a
simple correlation of each type with the known classification?

Visual inspection of the first group in the example output is revealing.
For known spam the header alone is an excellent surrogate for the whole
message, but the text parts are more variable.

As shown in the last group, the header is not such a good predictor of
total message spamicity when the original classification was "unsure",
even after corrective training. The parts are often more spammy than the
header.

-Bill
-- 
Sattre Press                            Curiosities of the Sky
http://sattre-press.com/                    by Garrett Serviss
info at sattre-press.com        http://sattre-press.com/csky.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spamsplit.tar.gz
Type: application/x-gzip
Size: 3288 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20041209/a06377aa/attachment.bin>


More information about the Bogofilter mailing list