Filtering and ignorelists

Jozef Hitzinger hitzinger at phobos.fphil.uniba.sk
Fri Mar 5 11:56:06 CET 2004


Hi Michael,

on one hand you're right, but on the other .. Let me explain.

> > If I want to do a statistical analysis of the message, and decide whether
> > it is spam or not, I'd choose to drop all info in the second category, and
> > only rule on the informational part.
>
> That would be crazy. Sure, could pick something up if you chopped
> off your left arm, but it would be easier if you didn't!
>
> The information about how it got to you is very very important
> in hinting if you actually wanted to receive it or not.

What you say _is_ true, if you want to filter spam. But if you would
happen to work on statistical spam filtering, it's no longer true. By
considering info on where it came from, you effectively sort the messages
into "buckets" and than you need to reach enough spam/ham for _each_
bucket for bogofilter to be effective. These "buckets" are for example
"rcvd:gnu.org", "rcvd:Feb" and are variously mixed, but the bottom line is
they break the statistic randomness.

> If _none_ of your normal email comes from yahoo.com, then an
> email with 'From: xxx at yahoo.com', but Received: headers
> that indicate an AOL dialup would be a rather large hint.

.. and then _one_ of your normal emails comes from yahoo.com, and you're
in a great risk of a false positive. Works in the other direction, too, if
you get only a few spams from some source, the occasional spam is marked
unsure/ham, because of hammish nature of that source. That's what I try
to explain.

> > I think that the complaints about rcvd:gnu.org and rcvd:Feb and similar
> > tokens support this assumption.
>
> I rather than they say that if you do a bad job of training bogofilter,
> then you'll get silly results. Not too suprising really. :)

Actually quite opposite. Despite the poor training (messages with info
about where they came from are no longer statisticaly random) bogofilter
is so robust, that it still performs quite well (you're satisfied) but
some people hit the limits (those complaining mentioned above).

> Why would you throw away all that information??

I'd throw away info that breaks the statistics.

> X-Mailer: megaMailBlaster
> Do you think that might be a hint about if the email is spam or not? :)

Do you think it'd be such a loss discarding it? I bet that if the spammer
is not able to disguise his mailer, he won't be able to write the message
cleverly enough to escape on the subject+body check.

Michael, I've skipped some of your questions, as I want this to discussion
to focus on facts. There _are_ people who have problems with current
bogofilter. The current training _is_not_ theoretically ok, so please
don't throw a modified version off the table so fast, will you? If it
proves bad, I'll be first to rip it away.

BTW, previously I've forgotten to mention one sideeffect of this header
tossing: you've no longer problems with registering mails forwarded by
your users - all the non-original To: and From: will be discarded anyway.
Just a small bonus provided by this approach.

Thanks for your time,
-- 
jozef  :-)  http://hico.fphil.uniba.sk





More information about the Bogofilter mailing list