rather unrelated but still...

Fri Feb 2 05:26:33 CET 2007

This is all rather unrelated to bogofilter directly but I thought I would pass 
this on to my favorite spam filter.

In the last two weeks I wrote a TCP/IP based (Network Socket) content filter for 
my postfix installation that runs a daemonized version of the spamassassin's 
spamc/spamd application with a twist.  It recognizes users and changes the 
database accordingly.  This is not something that I have been able to find 
previously except with dspam which has other problems for which I wouldn't 
recommend them.

I just wanted to share some of this here as an architectural proof of concept. 
I can barely get past "hello world" in C so I can't be of much direct help.  You 
don't want me trying to code anything.  I'm just as likely to write to the first 
sector on my hard drive as I am to make a database connection -- I don't know 
what I'm doing and won't pretend to either.

But here's what I've done.  Perhaps it can serve as a proof of concept.

I take the SMTP message from an inline content_filter and pick up the 
information on who it is FROM, TO, and the DATA.

foreach username in the TO list: (john at local, dick at local, jane at local)

Submit the data and the username to a daemon that runs the bayesian stats based 
on tokens for that user.  spamassassin returns a marked up message be default.

I then deliver one marked up message for each username it's supposed to be 
delivered to.

If it ever comes up to try anything like this for bogofilter this might make a 
suitable conceptual starting point.

I don't think it would necessarily be consistent with bogofilter to be doing all 
the SMTP transactions as well.  But a TCP/IP interface to submit information the 
the structure of a hash {username=>'john.doe at foo.bar', message=>$message} with 
some parameters today (auto-learn and such) and return the message as it does 
today under pass_through would suffice and someone else can worry about how to 
get it working.  This is conceptually what spamc/spamd do.

However from other things I've tried to do in the past with various success I 
would change the database schema to something like this:

The tokens are set up using a three table schema
users:  idx, usersname(email address), H_msgs, S_msgs
user_tokens: useridx, tokenidx, h_msgs, s_msgs
tokens:  tokenidx, token (the actual token).

This gives you excellent performance and with enough resources will actually 
load the result set for a frequently used set of users into RAM making it 
noticeably faster with caching.  At least that's what I've observed in 
Postgresql.  It's also different from spamassassin and dspam database schemas. 
If someone has a better idea...

I don't know how much of a database guru anyone is here.  I've noticed that a 
lot of the database centered applications I've looked into have a moderate to 
extremely poor database design.
Databases is kind of what I do...

Sorry to take up everyones time...  Thanks for listening.