Filtering and ignorelists

Fri Mar 5 11:04:38 CET 2004

Jozef Hitzinger <hitzinger at phobos.fphil.uniba.sk> writes:
> With all the ignore/filter discussion, let me summ it up a bit.
> 
> The e-mail, as it's received, has two main parts, headers & body, where
> these both are created as a result of quite complicated process, involving
> humans, MUAs, MTAs, mailinglists, etc. In the end we basically have two
> categories:
> 
> 1. stuff that carries the message
> 2. stuff that carries info about where it came from
> 
> If I want to do a statistical analysis of the message, and decide whether
> it is spam or not, I'd choose to drop all info in the second category, and
> only rule on the informational part.

That would be crazy. Sure, could pick something up if you chopped
off your left arm, but it would be easier if you didn't!

The information about how it got to you is very very important
in hinting if you actually wanted to receive it or not.

If _none_ of your normal email comes from yahoo.com, then an
email with 'From: xxx at yahoo.com', but Received: headers
that indicate an AOL dialup would be a rather large hint.

> I think that the complaints about rcvd:gnu.org and rcvd:Feb and similar
> tokens support this assumption.

I rather than they say that if you do a bad job of training bogofilter,
then you'll get silly results. Not too suprising really. :)

> Of course, there is also approach to spam that uses the info of where it
> comes from to rule on its spamminess, but that's closer to what CRM114
> does. I don't see that kind of info helping the bogofilter's statistics.

It's what any decent spam filter does. Why would you throw away all
that information??

> Then there are pure cat.2 headers, like Received:, X-Mailer, etc - almost
> all of them except the Subject.

X-Mailer: megaMailBlaster

Do you think that might be a hint about if the email is spam or not? :)

> step 1. we could filter only on Subject: header and body (would solve the
> rcvd:gnu.org and rcvd:Feb problems) .. Critical part is that both training
> and filtering must be done on stripped messages = db must be rebuilt from
> stored messages.
> 
> step 2. who wants to play with it further, can add not token-based but
> line-based filter, which would discard the well-known _lines_ from the
> messages

I have to ask: Why are you using a spam filter?

If the purpose is to filter spam, you really want to give the program
the best possible environment to operate in. And that means giving it
all the information you have about the email, and let the algorithm
decide what's important and what's not.

> step 3. possible token-based ignore list for the Subject prependicies ..
> 
> Of course, all this as options, who wants to use bogofilter as it is now,
> should be able to. One more think to ponder: the proposed step 1 - drop
> all headers except Subject - is easy to implement. I don't believe the 2.
> and 3. are worth the effort.

If someone is crazy enough to implement this I forcast an endless
stream of newbies asking "Why is bogofilter performing so bad?"

Why would you deliberately add an option that makes it perform
worse??

Michael.