religion

Wed Jan 22 17:29:28 CET 2003

At 11:09 AM 1/22/03, Matt Armstrong wrote:

>Matthias Andree <matthias.andree at gmx.de> writes:
>
>I'm the guy who suggested -u to ESR (I may have sent him a patch
>too...) back when bogofilter had only the Graham method.

Hi Matt,

So it's _you_ we have to blame !!!!!  Actually, I should share that blame 
with you.  At the time of Eric's last release, the status of "update" was 
"too hard to implement, given the current architecture".  I reorganized the 
code so that classification and registration (the two main activities) 
shared a routine called collect_words.  With that change, it became easy to 
make "update" work.

So, I guess we share the guilt - you had the idea and I did the dirty work.

>The idea was that it would save labor -- it gives you the benefit of
>training on your entire mail corpus with no more manual labor work
>than a simple train on error approach.  I never thought of it as an
>option without drawbacks -- you must train on all errors in a timely
>fashion -- but used properly it is more convenient for me.

I agree totally with this.  My unsures get put into a mailbox of their 
own.  I then split them into files labeled "unsure-good" and "unsure-spam" 
and a cronjob runs every hour and adds them to the proper wordlist.

>I went on using -u with Robinson-Fischer without realizing that it
>wasn't training on unsures.  I think -u is much less useful with an
>'unsure' state.

RespectfuLy, I disagree with you on this.  I find that "unsure" adds value.

>To rectify this, I think either:
>
>     (a) There should be an option to turn Robinson-Fisher into a
>         binary algorithm (i.e. treat 'unsure' as 'good').  Call it the
>         "benefit of the doubt" option.
>
>     (b) -u should update the 'good' list with 'unsures'
>
>I vote for both!  But if (a) happens, (b) comes for free.

Bogofilter currently, i.e. the beta version, has two relevant parameters 
which are named "spam_cutoff" and "ham_cutoff".  Any message with a score 
greater than or equal to spam_cutoff is spam.  If ham_cutoff is zero, all 
other messages are considered ham (and none are classified as unsure).  If 
ham_cutoff is non-zero, then its the dividing line between ham and unsure.

>I vote for (a) because I don't much care if the filter is unsure or
>not, I just want to know if the message *is* SPAM.  If it is, put it
>in the SPAM folder.  If it isn't, filter it to wherever else it'll go.
>If there is a false negative, I will catch it and retrain bogofilter
>regardless of whether it was 'unsure' or 'good'.

So, if you set ham_cutoff=0 in your config file, you will have what you want.

>I vote for (b) as well.  If -u updates an 'unsure' as 'good' and the
>message really is good, I have no additional work to do.  If -u
>updates an 'unsure' as 'good' and the message is SPAM, I'll notice
>eventually and retrain.
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com