Bogofiler with a specified wordlist.db

Thu Apr 6 22:56:25 CEST 2006

Tom Anderson wrote:
> mouss wrote:
>   
>> sure but bayesian filters require training. so
>> - their accuracy is poor at start.
>> - for users who don't retrain the filter, accuracy may never be 
>> satisfactory. (using a "global" wordlist may help, but not if these 
>> users receive mail that is different from the one used to train the 
>> global db).
>>     
>
> Accuracy has generally been above 95% within 48 hours and a few dozen 
> messages when training on error from scratch.  If starting with a corpus 
> of previous messages, it's much faster.
>
> If you don't want to train the filter, don't use Bogofilter or any other 
> statistical filter.  You won't be happy with the result, even if you 
> pair it with a procedural filter.
>   

do you have arguments here or is this pure speculation?
>   
>> Chris Idea is to "shoulder" (or boost?) bogo using SA. I would love to 
>> see the results of this. (I find this better than using public corpuses).
>>
>> for example, when you install bogo for the first time, you use SA too. 
>> if SA score is "sure" (<0 or >10 for instance), then train bogofilter 
>> with this email. There is still a risk of error (FN or FP) of course, 
>> but for users who don't retrain bogofilter, this is better than nothing.
>>
>> once the user's wordlist is "mature", SA can be skipped for that user.
>>     
>
> If you don't want to train, Bogofilter should be skipped altogether... 
> just use Spam Assassin exclusively.  The idea of "boosting" Bogofilter 
> with Spam Assassin is like hand-holding some grandma through her first 
> time creating a Word document and then walking away and letting her 
> loose on a Sendmail config.
>
>   
ahem? when you train bogofilter, you are "boosting" it with a human 
filter (and human filters aren't perfect). the human filter may be 
better than an automated one, but it is also more expensive (if the 
machine can work...) and doesn't require "education"...

>>> I feel that adding Spam Assassin to the mix would only introduce false 
>>> positives, of which I currently recieve zero.
>>>       
>> one can reduce this by using a conservative setup (disable or lower the 
>> score of rules that generate FPs).
>>     
>
> You can't reduce false positives below zero.  Just train the 4 errors 
> per week in Bogofilter and be done with it.  Spam Assassin doesn't bring 
> anything to the table unless you want complete and total automation and 
> don't mind receiving lots of FNs and discarding FPs.  And if that's what 
> you want, then Bogofilter isn't for you.  Bogofilter provides 
> near-complete automation and unrivalled accuracy, but you have to 
> provide feedback every few days to keep it on track.
>
> This can certainly be a little harder to achieve in a multiuser setup, 
>   
This is what we are talking about. if it's just for me, I have a large 
corpus of personal mail, both ham and spam and I know how to train. but 
the problem is with "the others":)

> but in that case, I would recommend training a global wordlist on the 
> input from a honeypot (known spam) and that of your SMTP server (known 
> ham).  You could probably even manage to use per-user wordlists trained 
> with their own outgoing mail as ham and the honeypot as spam.  It may 
> not be quite as good as training their own actual errors, but I'd wager 
> it'd be much better than Spam Assassin, and it would be completely 
> automated.
>   
can detail this please?