rather unrelated but still...
Tom Allison
tom at tacocat.net
Fri Feb 2 05:26:33 CET 2007
This is all rather unrelated to bogofilter directly but I thought I would pass
this on to my favorite spam filter.
In the last two weeks I wrote a TCP/IP based (Network Socket) content filter for
my postfix installation that runs a daemonized version of the spamassassin's
spamc/spamd application with a twist. It recognizes users and changes the
database accordingly. This is not something that I have been able to find
previously except with dspam which has other problems for which I wouldn't
recommend them.
I just wanted to share some of this here as an architectural proof of concept.
I can barely get past "hello world" in C so I can't be of much direct help. You
don't want me trying to code anything. I'm just as likely to write to the first
sector on my hard drive as I am to make a database connection -- I don't know
what I'm doing and won't pretend to either.
But here's what I've done. Perhaps it can serve as a proof of concept.
I take the SMTP message from an inline content_filter and pick up the
information on who it is FROM, TO, and the DATA.
foreach username in the TO list: (john at local, dick at local, jane at local)
Submit the data and the username to a daemon that runs the bayesian stats based
on tokens for that user. spamassassin returns a marked up message be default.
I then deliver one marked up message for each username it's supposed to be
delivered to.
If it ever comes up to try anything like this for bogofilter this might make a
suitable conceptual starting point.
I don't think it would necessarily be consistent with bogofilter to be doing all
the SMTP transactions as well. But a TCP/IP interface to submit information the
the structure of a hash {username=>'john.doe at foo.bar', message=>$message} with
some parameters today (auto-learn and such) and return the message as it does
today under pass_through would suffice and someone else can worry about how to
get it working. This is conceptually what spamc/spamd do.
However from other things I've tried to do in the past with various success I
would change the database schema to something like this:
The tokens are set up using a three table schema
users: idx, usersname(email address), H_msgs, S_msgs
user_tokens: useridx, tokenidx, h_msgs, s_msgs
tokens: tokenidx, token (the actual token).
This gives you excellent performance and with enough resources will actually
load the result set for a frequently used set of users into RAM making it
noticeably faster with caching. At least that's what I've observed in
Postgresql. It's also different from spamassassin and dspam database schemas.
If someone has a better idea...
I don't know how much of a database guru anyone is here. I've noticed that a
lot of the database centered applications I've looked into have a moderate to
extremely poor database design.
Databases is kind of what I do...
Sorry to take up everyones time... Thanks for listening.
More information about the Bogofilter
mailing list