Andersons wrapper...

Tue Jun 19 23:44:19 CEST 2007

Tom Allison wrote:
> Your the guy!!!
> 
> I forgot your name, but now the project.
> 
> But I got an idea that is spin off from something you were doing with IP 
> addresses and domains.
> I can't remember what it was called, but you created a wrapper job that 
> would add some information to the email about who owned a domain or subnet.
> What/How was that?  It was over a year ago and I don't keep mailing list 
> data that long.

I'm the guy!  I've written several Bogofilter helper scripts...

*  Spamitarium -- There were some persistent classes of spam which were 
consistently getting past bogofilter. They were usually of the sort 
where there was virtually no body content, or the body content was 
highly manipulated to seem hammy, and therefore bogofilter relied much 
more heavily on the content of the email header. When the header 
contained certain amounts of "noise", or fields which are generally 
neutral or hammy and lend little relevance, or when the header had 
certain tricks to specifically mask the sender's true identity, 
bogofilter delivered false negatives. And while this wasn't a huge 
amount, it still annoyed me nonetheless, and so spamitarium was born. It 
is processed by the MDA before being sent to bogofilter and cleans up 
the email header first. See the documentation to learn how it does that.
http://orderamidchaos.com/bogofilter/spamitarium

* Stripsearch -- Even after a thorough head examination, some spams 
still fooled the lie detector. Therefore, it was necessary to get down 
and dirty with the email's body. There's little hope of determining 
automatically whether a paragraph of hammy-looking text is actually a 
ham message, taken randomly out of a book, or pieced together from a 
thesaurus. And when there's no text at all -- just an image -- a 
text-based statistical filter is simply dumbfounded. Image processing or 
character recognition are way too processor intensive for this task. But 
all spams have something fundamental in common... a link. Spam needs a 
URL for the victim to click through, or else the spammer can't get any 
information or money, and there would be no motive for spamming (except 
of the political or religious variety, which is thankfully sparse and 
easily filtered, and the stock pumping variety, which remains more of a 
problem). So, what can we do with the link? Bogofilter can already match 
domain names as a part of normal filtering, but spammers are notorious 
for moving around from server to server, changing links each time they 
send a new mass email. Therefore, it is highly unlikely that an 
individual or small organization will have seen most of these links 
before. But through the power of the internet, individuals can band 
together into a great force -- in this case, a URIBL (uniform resource 
identifier block list), which is simply a list of spamvertized addresses 
as reported by early victims or maintainers of honey pots. Stripsearch 
parses email bodies, finds the URIs and looks them up in the URIBLs 
using DNS. If an address matches, the token SPAM-ADDRESS is added to the 
email to improve bogofilter's results, and a link is included to look up 
which lists are causing the match. Stripsearch won't decide ham or spam 
all by itself... it simply improves bogofilter's ability to decide by 
providing an extra piece of information.
http://orderamidchaos.com/bogofilter/stripsearch

* Bfproxy -- Training bogofilter generally requires ssh'ing to your 
account and manually typing commands to process emails which were 
classified incorrectly. This is unacceptable for average users. Other 
people have come up with a plethora of ways around this, but none to my 
satisfaction. Since I use mbox style mailboxes with POP3, some of the 
more common methods of dragging emails into IMAP maildir folders 
wouldn't work for my situation. And I didn't like the idea of creating 
special email accounts for each user in order to send corrections. So I 
built bfproxy, which allows users to send corrections to bogofilter as 
simple email attachments to their existing address with some additional 
details included. It couldn't be easier. Moreover, I've included some 
additional functionality like exhaustive training which improves overall 
effectiveness.
http://orderamidchaos.com/bogofilter/bfproxy

> At the time I did some testing an found it had marginal improvement for me.
> And as a result of that I might have pissed you off because I may have 
> come back with a trivialization of your work.
> Sorry if I did.  I realize now you may have hit on something pretty 
> significant and I want to thank you for it.

I've always known it was significant and it has been improving my spam 
filtering since I wrote it.  I made it available to the rest of the 
community in the hopes that others would find it useful, but frankly I 
couldn't care in the least whether anyone actually did, as it works for 
me and that's all that really matters.

> I recently spun off and started my own spam filter based on bogofilter 
> but something than can run on a per-user basis as a postgres 
> content_filter like amavisd.
> The value for me is that I can bypass procmail if I want to and even use 
> this at a proxy mail server located somewhere else.  I actually do the 
> scoring over the internet to a remove machine, this lets me run the same 
> criteria on primary and secondary MX machines.
> It's proviing itself useful and sufficiently fast to process email at a 
> rate of typically < 1 second.
> But that's now where you come in.  Not just yet.
> 
> I wrote a similar thing for SpamAssassin which does to per-user bayesian 
> statistics and found SA was painfully slow, relatively inaccurate, and 
> prone to a systemic problem in design.
> SpamAssassin still relies on a combination of static rules, point 
> assignments for those rules, and a continuous stream of static rule 
> updates.  Between the three, there is a lot of maintenance, tuning, and 
> guess work.  Way too much work for me.  I am pretty sure I discarded 
> this work as I've no interest in supporting something that take too much 
> resource to keep running.
>  
> But there is an idea that I came out of SA that I thought made more 
> sense.  In some cases there are, or may be, static rules that are 
> important considerations to the determination of an email.  After all, 
> SA still has some effect even without Bayes tokens involved (or much 
> considered).  But what if the static rules were added as Meta-Data 
> tokens to the bayesian statistical "engine" in the same manner that they 
> might be an X-Header or some other text?  You started doing this as 
> X-Headers so that bogofilter would see it as a token in the header.  I 
> think you were on to something.

Yeah, me too.  It works quite well.

> The advantage to doing this approach to adding meta data to the email is 
> to essentially keep using the static rules of SA but without any of the 
> guess work on what's valuable (how many points to assign for a Hit) or 
> even what to use (dumb rules end up within the +/- deviation from 0.5).  
> Then you can add all the rules you can think of and let the statistics 
> decide if the rules has any value for you (as a user) or your server 
> (common word list).  It removes all the guesswork and maintenance that 
> SA suffers from but might permit better response to new approaches to 
> writing spam.  It also provides a means of spaghetti testing (throw 
> everything at the wall and see what sticks) which might be useful in a 
> community development environment.

Statistical testing is certainly a way to determine if rules are 
effective or not.

> For example image spam seems to pattern in /<body[^\>]*\>\s+<IMG/smi.  
> If you wrote a Meta-Data rule that simply stated: IMAGE_SPAM_PATTERN:YES 
> or IMAGE_SPAM_PATTERN:NO then this string is provided as a token and 
> subsequence corrections would quickly establish the value of this rule.  
> Filtering your wordlist for the meta data tags would tell you if it's 
> making a valuable contribution.
> As another example, I store the path taken in the Received headers as a 
> single string (eg:  
> hdr:Received:sc8-sf-mx2-b.sourceforge.net([10.3.1.92]helo=mail.sourceforge.net):sc8-sf-list1-new.sourceforge.net) 
> and use that for enhancing the scoring.  It establishes a path of 
> delivery points through the mail processing and determines which ones 
> are good/bad.
> 
> So, I wanted to tell you this as a "Thanks, that's a cool idea you came 
> up with" and to also ask you, "How was that again?"

The code is linked above.  I do lots of received line processing in 
spamitarium which has proven quite useful.  A review of the tokens in my 
wordlist bear that out.  Feel free to incorporate some of my techniques 
in your program, but I would appreciate a mention and link.  And let me 
know how it goes.

Tom