Bogofiler with a specified wordlist.db

Fri Apr 7 01:50:05 CEST 2006

Tom Anderson wrote:
> mouss wrote:
>   
>
> This is based on the theory of statistical filtering and on personal 
> experience. 
As of today, I am not aware of any mathematical proofs that support 
bayesian filtering. while it has been shown that the independence 
conditions are not necessary for Bayes formula, we are still left with 
the fact that practice is the only "proof".

>  When my users throw out errors instead of training them, 
> their accuracy declines as a result.  For some users, this is OK.  They 
> let the accuracy decrease until they get annoyed by it and then they 
> train a bunch of their errors and bring the accuracy back up again. 
> Those who always train as a matter of course receive virtually no spam 
> as a result, and thus they don't need to train very much.  But if you 
> don't want to train at all, then your accuracy will slide until the 
> classifications are largely FNs or FPs.  If you have -u enabled, I 
> believe they will eventually be all one or the other.  Therefore, 
> eventually, there will be no filtering at all.  If you're pairing it 
> with a procedural filter for some reason, then only the procedural 
> filter will be filtering at this point.
>
>   
The idea is to first make bogo somewhat learn SA, until the user's db is 
large enough. then user feedback will be required. but this should 
happen less than if the user starts with just bogo. of course, I have no 
theoritical nor practical proof of this. It may be just wrong. and even 
if "theoritically true", it may be the wrong approach.
>
> The recipient is not a filter; a filter is something or someone who 
> parses the input before the recipient receives it.  I would never send 
> my email through a human filter and expect even remotely acceptable 
> results.  Training is done by the recipient, not a filter.  If you try 
> to train a statistical filter with a less accurate filter, you're just 
> compounding inaccuracies.  You're shooting yourself in the foot.
>   
The recipient _is_ a _filter_. This is what I meant by "human filter". 
The recipient looks at the mail, and classifies it. and this human 
filter doesn't have a perfect accuracy. Some people will classify as 
spam mail from lists they subscribed to. others will classify as spam a 
mail they don't understand. some people will believe hoaxes and 
phishes... etc. Even "good" human classifiers get tired, or they may 
review too quickly....

So from a theoritical standpoint, users feedback/classification is a 
filter. I am not saying it could be replaced by an automata. but I am 
just trying to see if it is feasible to improve the accuracy of the 
filter by replacing "bad" human filters (people who don't train) with a 
procedural filter. I still believe recipient feedback is the way to go 
... when possible!

> I have my users train their own wordlists.  They are responsible for the 
> accuracy they receive by choosing to be vigilent in training errors or 
> not.  I use "bfproxy" to allow them to train via email (it's in your 
> Bogofilter contrib directory).  They simply drag their errors to a 
> folder and then occassionally forward the contents of that folder to 
> their bfproxy address (added to their address books for simplicity).  It 
> has worked fine for me.
>
>   
I agree that people should be responsible for their accuracy, and should 
thus train the filter with their mail. but there are two cases that 
cause me trouble here:
- new users. waiting until the filter matures is not acceptable. using a 
reference corpus or a reference wordlist is feasible. but I am not 
certain this is the way to go. what if their mail is "too different"? 
(more on this below).
- lazy (to stay polite:) users: I mean people who either don't train, or 
those who train inconsistently (which may pollute their db)...

(*) I personally have multiple accounts, used for different purposes. 
and while a lot of spam is common, some accounts "attract" spam that 
other accounts don't get. and ham is completely different (for some 
addresses, a non french message is almost certainly spam. for others, a 
french mail is almost certainly spam... etc). In this particular case (I 
admit it is particular), global corpuses don't seem to be the right 
choice. but I may be wrong.

>>> but in that case, I would recommend training a global wordlist on the 
>>> input from a honeypot (known spam) and that of your SMTP server (known 
>>> ham).  You could probably even manage to use per-user wordlists 
>>> trained with their own outgoing mail as ham and the honeypot as spam.  
>>> It may not be quite as good as training their own actual errors, but 
>>> I'd wager it'd be much better than Spam Assassin, and it would be 
>>> completely automated.
>>>       
>> can detail this please?
>>     
>
> I've not done it myself, but if I wanted a completely automated 
> Bogofilter, I'd first create a spam honeypot address (or spamtrap as 
> others may call it) and post it around the web a bit on blogs or 
> websites I control or I know won't send valid traffic.  Then I'd 
> register any email that arrives at this address as spam.  I know I've 
> had users leave and later end up with a full quota of spam when I forgot 
> to delete their account, so maybe I'd use something like that as a 
> honeypot if I had it and knew it wasn't receiving any legitimate email 
> at all.
>   
I have enough of trapped spam. ham is the problem.
> Capturing outgoing hams would be a little more difficult.  You'd have to 
> setup a script that runs in Sendmail or samples the mail spool, 
> registering all of those as hams.  I would do some research on Sendmail 
> to see if direct integration would be possible.  I don't know if there's 
> an outgoing milter functionality.
>   
I use postfix with 587 as the submission port, so this is trivial to do. 
Now I fear "automated replies" and undetected viruses. Assuming that 
these are marginal, can I just ignore them?
> If you intend to go that route, I'd love to hear about success/problems. 
>   Remember, this won't be nearly as accurate as training on error by the 
> recpients.  But it should be better than procedural filters.
>   
What would be the easiest way to set this up? I am thinking of 
duplicating an account (or more) so that the results may be compared.