Andersons wrapper...
Tom Anderson
tanderso at oac-design.com
Tue Jun 19 23:44:19 CEST 2007
Tom Allison wrote:
> Your the guy!!!
>
> I forgot your name, but now the project.
>
> But I got an idea that is spin off from something you were doing with IP
> addresses and domains.
> I can't remember what it was called, but you created a wrapper job that
> would add some information to the email about who owned a domain or subnet.
> What/How was that? It was over a year ago and I don't keep mailing list
> data that long.
I'm the guy! I've written several Bogofilter helper scripts...
* Spamitarium -- There were some persistent classes of spam which were
consistently getting past bogofilter. They were usually of the sort
where there was virtually no body content, or the body content was
highly manipulated to seem hammy, and therefore bogofilter relied much
more heavily on the content of the email header. When the header
contained certain amounts of "noise", or fields which are generally
neutral or hammy and lend little relevance, or when the header had
certain tricks to specifically mask the sender's true identity,
bogofilter delivered false negatives. And while this wasn't a huge
amount, it still annoyed me nonetheless, and so spamitarium was born. It
is processed by the MDA before being sent to bogofilter and cleans up
the email header first. See the documentation to learn how it does that.
http://orderamidchaos.com/bogofilter/spamitarium
* Stripsearch -- Even after a thorough head examination, some spams
still fooled the lie detector. Therefore, it was necessary to get down
and dirty with the email's body. There's little hope of determining
automatically whether a paragraph of hammy-looking text is actually a
ham message, taken randomly out of a book, or pieced together from a
thesaurus. And when there's no text at all -- just an image -- a
text-based statistical filter is simply dumbfounded. Image processing or
character recognition are way too processor intensive for this task. But
all spams have something fundamental in common... a link. Spam needs a
URL for the victim to click through, or else the spammer can't get any
information or money, and there would be no motive for spamming (except
of the political or religious variety, which is thankfully sparse and
easily filtered, and the stock pumping variety, which remains more of a
problem). So, what can we do with the link? Bogofilter can already match
domain names as a part of normal filtering, but spammers are notorious
for moving around from server to server, changing links each time they
send a new mass email. Therefore, it is highly unlikely that an
individual or small organization will have seen most of these links
before. But through the power of the internet, individuals can band
together into a great force -- in this case, a URIBL (uniform resource
identifier block list), which is simply a list of spamvertized addresses
as reported by early victims or maintainers of honey pots. Stripsearch
parses email bodies, finds the URIs and looks them up in the URIBLs
using DNS. If an address matches, the token SPAM-ADDRESS is added to the
email to improve bogofilter's results, and a link is included to look up
which lists are causing the match. Stripsearch won't decide ham or spam
all by itself... it simply improves bogofilter's ability to decide by
providing an extra piece of information.
http://orderamidchaos.com/bogofilter/stripsearch
* Bfproxy -- Training bogofilter generally requires ssh'ing to your
account and manually typing commands to process emails which were
classified incorrectly. This is unacceptable for average users. Other
people have come up with a plethora of ways around this, but none to my
satisfaction. Since I use mbox style mailboxes with POP3, some of the
more common methods of dragging emails into IMAP maildir folders
wouldn't work for my situation. And I didn't like the idea of creating
special email accounts for each user in order to send corrections. So I
built bfproxy, which allows users to send corrections to bogofilter as
simple email attachments to their existing address with some additional
details included. It couldn't be easier. Moreover, I've included some
additional functionality like exhaustive training which improves overall
effectiveness.
http://orderamidchaos.com/bogofilter/bfproxy
> At the time I did some testing an found it had marginal improvement for me.
> And as a result of that I might have pissed you off because I may have
> come back with a trivialization of your work.
> Sorry if I did. I realize now you may have hit on something pretty
> significant and I want to thank you for it.
I've always known it was significant and it has been improving my spam
filtering since I wrote it. I made it available to the rest of the
community in the hopes that others would find it useful, but frankly I
couldn't care in the least whether anyone actually did, as it works for
me and that's all that really matters.
> I recently spun off and started my own spam filter based on bogofilter
> but something than can run on a per-user basis as a postgres
> content_filter like amavisd.
> The value for me is that I can bypass procmail if I want to and even use
> this at a proxy mail server located somewhere else. I actually do the
> scoring over the internet to a remove machine, this lets me run the same
> criteria on primary and secondary MX machines.
> It's proviing itself useful and sufficiently fast to process email at a
> rate of typically < 1 second.
> But that's now where you come in. Not just yet.
>
> I wrote a similar thing for SpamAssassin which does to per-user bayesian
> statistics and found SA was painfully slow, relatively inaccurate, and
> prone to a systemic problem in design.
> SpamAssassin still relies on a combination of static rules, point
> assignments for those rules, and a continuous stream of static rule
> updates. Between the three, there is a lot of maintenance, tuning, and
> guess work. Way too much work for me. I am pretty sure I discarded
> this work as I've no interest in supporting something that take too much
> resource to keep running.
>
> But there is an idea that I came out of SA that I thought made more
> sense. In some cases there are, or may be, static rules that are
> important considerations to the determination of an email. After all,
> SA still has some effect even without Bayes tokens involved (or much
> considered). But what if the static rules were added as Meta-Data
> tokens to the bayesian statistical "engine" in the same manner that they
> might be an X-Header or some other text? You started doing this as
> X-Headers so that bogofilter would see it as a token in the header. I
> think you were on to something.
Yeah, me too. It works quite well.
> The advantage to doing this approach to adding meta data to the email is
> to essentially keep using the static rules of SA but without any of the
> guess work on what's valuable (how many points to assign for a Hit) or
> even what to use (dumb rules end up within the +/- deviation from 0.5).
> Then you can add all the rules you can think of and let the statistics
> decide if the rules has any value for you (as a user) or your server
> (common word list). It removes all the guesswork and maintenance that
> SA suffers from but might permit better response to new approaches to
> writing spam. It also provides a means of spaghetti testing (throw
> everything at the wall and see what sticks) which might be useful in a
> community development environment.
Statistical testing is certainly a way to determine if rules are
effective or not.
> For example image spam seems to pattern in /<body[^\>]*\>\s+<IMG/smi.
> If you wrote a Meta-Data rule that simply stated: IMAGE_SPAM_PATTERN:YES
> or IMAGE_SPAM_PATTERN:NO then this string is provided as a token and
> subsequence corrections would quickly establish the value of this rule.
> Filtering your wordlist for the meta data tags would tell you if it's
> making a valuable contribution.
> As another example, I store the path taken in the Received headers as a
> single string (eg:
> hdr:Received:sc8-sf-mx2-b.sourceforge.net([10.3.1.92]helo=mail.sourceforge.net):sc8-sf-list1-new.sourceforge.net)
> and use that for enhancing the scoring. It establishes a path of
> delivery points through the mail processing and determines which ones
> are good/bad.
>
> So, I wanted to tell you this as a "Thanks, that's a cool idea you came
> up with" and to also ask you, "How was that again?"
The code is linked above. I do lots of received line processing in
spamitarium which has proven quite useful. A review of the tokens in my
wordlist bear that out. Feel free to incorporate some of my techniques
in your program, but I would appreciate a mention and link. And let me
know how it goes.
Tom
More information about the Bogofilter
mailing list