Andersons wrapper...

Tom Allison tom at tacocat.net
Mon Jun 11 12:02:54 CEST 2007


Your the guy!!!

I forgot your name, but now the project.

But I got an idea that is spin off from something you were doing with  
IP addresses and domains.
I can't remember what it was called, but you created a wrapper job  
that would add some information to the email about who owned a domain  
or subnet.
What/How was that?  It was over a year ago and I don't keep mailing  
list data that long.


At the time I did some testing an found it had marginal improvement  
for me.
And as a result of that I might have pissed you off because I may  
have come back with a trivialization of your work.
Sorry if I did.  I realize now you may have hit on something pretty  
significant and I want to thank you for it.


I recently spun off and started my own spam filter based on  
bogofilter but something than can run on a per-user basis as a  
postgres content_filter like amavisd.
The value for me is that I can bypass procmail if I want to and even  
use this at a proxy mail server located somewhere else.  I actually  
do the scoring over the internet to a remove machine, this lets me  
run the same criteria on primary and secondary MX machines.
It's proviing itself useful and sufficiently fast to process email at  
a rate of typically < 1 second.
But that's now where you come in.  Not just yet.

I wrote a similar thing for SpamAssassin which does to per-user  
bayesian statistics and found SA was painfully slow, relatively  
inaccurate, and prone to a systemic problem in design.
SpamAssassin still relies on a combination of static rules, point  
assignments for those rules, and a continuous stream of static rule  
updates.  Between the three, there is a lot of maintenance, tuning,  
and guess work.  Way too much work for me.  I am pretty sure I  
discarded this work as I've no interest in supporting something that  
take too much resource to keep running.


But there is an idea that I came out of SA that I thought made more  
sense.  In some cases there are, or may be, static rules that are  
important considerations to the determination of an email.  After  
all, SA still has some effect even without Bayes tokens involved (or  
much considered).  But what if the static rules were added as Meta- 
Data tokens to the bayesian statistical "engine" in the same manner  
that they might be an X-Header or some other text?  You started doing  
this as X-Headers so that bogofilter would see it as a token in the  
header.  I think you were on to something.

The advantage to doing this approach to adding meta data to the email  
is to essentially keep using the static rules of SA but without any  
of the guess work on what's valuable (how many points to assign for a  
Hit) or even what to use (dumb rules end up within the +/- deviation  
from 0.5).  Then you can add all the rules you can think of and let  
the statistics decide if the rules has any value for you (as a user)  
or your server (common word list).  It removes all the guesswork and  
maintenance that SA suffers from but might permit better response to  
new approaches to writing spam.  It also provides a means of  
spaghetti testing (throw everything at the wall and see what sticks)  
which might be useful in a community development environment.

For example image spam seems to pattern in /<body[^\>]*\>\s+<IMG/ 
smi.  If you wrote a Meta-Data rule that simply stated:  
IMAGE_SPAM_PATTERN:YES or IMAGE_SPAM_PATTERN:NO then this string is  
provided as a token and subsequence corrections would quickly  
establish the value of this rule.  Filtering your wordlist for the  
meta data tags would tell you if it's making a valuable contribution.
As another example, I store the path taken in the Received headers as  
a single string (eg:  hdr:Received:sc8-sf-mx2-b.sourceforge.net 
([10.3.1.92]helo=mail.sourceforge.net):sc8-sf-list1- 
new.sourceforge.net) and use that for enhancing the scoring.  It  
establishes a path of delivery points through the mail processing and  
determines which ones are good/bad.


So, I wanted to tell you this as a "Thanks, that's a cool idea you  
came up with" and to also ask you, "How was that again?"




More information about the Bogofilter mailing list