Bogofiler with a specified wordlist.db

Fri Apr 7 00:52:51 CEST 2006

mouss wrote:
> Tom Anderson wrote:
>> If you don't want to train the filter, don't use Bogofilter or any 
>> other statistical filter.  You won't be happy with the result, even if 
>> you pair it with a procedural filter.
> 
> do you have arguments here or is this pure speculation?

This is based on the theory of statistical filtering and on personal 
experience.  When my users throw out errors instead of training them, 
their accuracy declines as a result.  For some users, this is OK.  They 
let the accuracy decrease until they get annoyed by it and then they 
train a bunch of their errors and bring the accuracy back up again. 
Those who always train as a matter of course receive virtually no spam 
as a result, and thus they don't need to train very much.  But if you 
don't want to train at all, then your accuracy will slide until the 
classifications are largely FNs or FPs.  If you have -u enabled, I 
believe they will eventually be all one or the other.  Therefore, 
eventually, there will be no filtering at all.  If you're pairing it 
with a procedural filter for some reason, then only the procedural 
filter will be filtering at this point.

>> If you don't want to train, Bogofilter should be skipped altogether... 
>> just use Spam Assassin exclusively.  The idea of "boosting" Bogofilter 
>> with Spam Assassin is like hand-holding some grandma through her first 
>> time creating a Word document and then walking away and letting her 
>> loose on a Sendmail config.
> 
> ahem? when you train bogofilter, you are "boosting" it with a human 
> filter (and human filters aren't perfect). the human filter may be 
> better than an automated one, but it is also more expensive (if the 
> machine can work...) and doesn't require "education"...

The recipient is not a filter; a filter is something or someone who 
parses the input before the recipient receives it.  I would never send 
my email through a human filter and expect even remotely acceptable 
results.  Training is done by the recipient, not a filter.  If you try 
to train a statistical filter with a less accurate filter, you're just 
compounding inaccuracies.  You're shooting yourself in the foot.

>> This can certainly be a little harder to achieve in a multiuser setup,   
> 
> This is what we are talking about. if it's just for me, I have a large 
> corpus of personal mail, both ham and spam and I know how to train. but 
> the problem is with "the others":)

I have my users train their own wordlists.  They are responsible for the 
accuracy they receive by choosing to be vigilent in training errors or 
not.  I use "bfproxy" to allow them to train via email (it's in your 
Bogofilter contrib directory).  They simply drag their errors to a 
folder and then occassionally forward the contents of that folder to 
their bfproxy address (added to their address books for simplicity).  It 
has worked fine for me.

>> but in that case, I would recommend training a global wordlist on the 
>> input from a honeypot (known spam) and that of your SMTP server (known 
>> ham).  You could probably even manage to use per-user wordlists 
>> trained with their own outgoing mail as ham and the honeypot as spam.  
>> It may not be quite as good as training their own actual errors, but 
>> I'd wager it'd be much better than Spam Assassin, and it would be 
>> completely automated.
> 
> can detail this please?

I've not done it myself, but if I wanted a completely automated 
Bogofilter, I'd first create a spam honeypot address (or spamtrap as 
others may call it) and post it around the web a bit on blogs or 
websites I control or I know won't send valid traffic.  Then I'd 
register any email that arrives at this address as spam.  I know I've 
had users leave and later end up with a full quota of spam when I forgot 
to delete their account, so maybe I'd use something like that as a 
honeypot if I had it and knew it wasn't receiving any legitimate email 
at all.

Capturing outgoing hams would be a little more difficult.  You'd have to 
setup a script that runs in Sendmail or samples the mail spool, 
registering all of those as hams.  I would do some research on Sendmail 
to see if direct integration would be possible.  I don't know if there's 
an outgoing milter functionality.

If you intend to go that route, I'd love to hear about success/problems. 
  Remember, this won't be nearly as accurate as training on error by the 
recpients.  But it should be better than procedural filters.

Tom