Andersons wrapper...
Tom Allison
tom at tacocat.net
Mon Jun 11 12:02:54 CEST 2007
Your the guy!!!
I forgot your name, but now the project.
But I got an idea that is spin off from something you were doing with
IP addresses and domains.
I can't remember what it was called, but you created a wrapper job
that would add some information to the email about who owned a domain
or subnet.
What/How was that? It was over a year ago and I don't keep mailing
list data that long.
At the time I did some testing an found it had marginal improvement
for me.
And as a result of that I might have pissed you off because I may
have come back with a trivialization of your work.
Sorry if I did. I realize now you may have hit on something pretty
significant and I want to thank you for it.
I recently spun off and started my own spam filter based on
bogofilter but something than can run on a per-user basis as a
postgres content_filter like amavisd.
The value for me is that I can bypass procmail if I want to and even
use this at a proxy mail server located somewhere else. I actually
do the scoring over the internet to a remove machine, this lets me
run the same criteria on primary and secondary MX machines.
It's proviing itself useful and sufficiently fast to process email at
a rate of typically < 1 second.
But that's now where you come in. Not just yet.
I wrote a similar thing for SpamAssassin which does to per-user
bayesian statistics and found SA was painfully slow, relatively
inaccurate, and prone to a systemic problem in design.
SpamAssassin still relies on a combination of static rules, point
assignments for those rules, and a continuous stream of static rule
updates. Between the three, there is a lot of maintenance, tuning,
and guess work. Way too much work for me. I am pretty sure I
discarded this work as I've no interest in supporting something that
take too much resource to keep running.
But there is an idea that I came out of SA that I thought made more
sense. In some cases there are, or may be, static rules that are
important considerations to the determination of an email. After
all, SA still has some effect even without Bayes tokens involved (or
much considered). But what if the static rules were added as Meta-
Data tokens to the bayesian statistical "engine" in the same manner
that they might be an X-Header or some other text? You started doing
this as X-Headers so that bogofilter would see it as a token in the
header. I think you were on to something.
The advantage to doing this approach to adding meta data to the email
is to essentially keep using the static rules of SA but without any
of the guess work on what's valuable (how many points to assign for a
Hit) or even what to use (dumb rules end up within the +/- deviation
from 0.5). Then you can add all the rules you can think of and let
the statistics decide if the rules has any value for you (as a user)
or your server (common word list). It removes all the guesswork and
maintenance that SA suffers from but might permit better response to
new approaches to writing spam. It also provides a means of
spaghetti testing (throw everything at the wall and see what sticks)
which might be useful in a community development environment.
For example image spam seems to pattern in /<body[^\>]*\>\s+<IMG/
smi. If you wrote a Meta-Data rule that simply stated:
IMAGE_SPAM_PATTERN:YES or IMAGE_SPAM_PATTERN:NO then this string is
provided as a token and subsequence corrections would quickly
establish the value of this rule. Filtering your wordlist for the
meta data tags would tell you if it's making a valuable contribution.
As another example, I store the path taken in the Received headers as
a single string (eg: hdr:Received:sc8-sf-mx2-b.sourceforge.net
([10.3.1.92]helo=mail.sourceforge.net):sc8-sf-list1-
new.sourceforge.net) and use that for enhancing the scoring. It
establishes a path of delivery points through the mail processing and
determines which ones are good/bad.
So, I wanted to tell you this as a "Thanks, that's a cool idea you
came up with" and to also ask you, "How was that again?"
More information about the Bogofilter
mailing list