Filtering and ignorelists

Jozef Hitzinger hitzinger at phobos.fphil.uniba.sk
Fri Mar 5 09:04:49 CET 2004


With all the ignore/filter discussion, let me summ it up a bit.

The e-mail, as it's received, has two main parts, headers & body, where
these both are created as a result of quite complicated process, involving
humans, MUAs, MTAs, mailinglists, etc. In the end we basically have two
categories:

1. stuff that carries the message
2. stuff that carries info about where it came from

If I want to do a statistical analysis of the message, and decide whether
it is spam or not, I'd choose to drop all info in the second category, and
only rule on the informational part.

I think that the complaints about rcvd:gnu.org and rcvd:Feb and similar
tokens support this assumption.

Of course, there is also approach to spam that uses the info of where it
comes from to rule on its spamminess, but that's closer to what CRM114
does. I don't see that kind of info helping the bogofilter's statistics.


Let's go back to the e-mail message. The Subject: header belongs mostly to
cat.1 The exception are all those [MAILLIST] [**SPAM**] etc sometimes
attached to it.

Then there are pure cat.2 headers, like Received:, X-Mailer, etc - almost
all of them except the Subject.

The To: From: and Return-Path: headers don't fit to any category, but
there are good reasons to drop them.

The body is almost perfectly cat.1, the only thing that can be cat.2 are
all the small adverts appended at the bottom by some servers and
mailinglists. So:


step 1. we could filter only on Subject: header and body (would solve the
rcvd:gnu.org and rcvd:Feb problems) .. Critical part is that both training
and filtering must be done on stripped messages = db must be rebuilt from
stored messages.

step 2. who wants to play with it further, can add not token-based but
line-based filter, which would discard the well-known _lines_ from the
messages

step 3. possible token-based ignore list for the Subject prependicies ..

Of course, all this as options, who wants to use bogofilter as it is now,
should be able to. One more think to ponder: the proposed step 1 - drop
all headers except Subject - is easy to implement. I don't believe the 2.
and 3. are worth the effort.


Just my two cents.
-- 
jozef  :-)  http://hico.fphil.uniba.sk




More information about the Bogofilter mailing list