religion
David Relson
relson at osagesoftware.com
Wed Jan 22 17:29:28 CET 2003
At 11:09 AM 1/22/03, Matt Armstrong wrote:
>Matthias Andree <matthias.andree at gmx.de> writes:
>
>I'm the guy who suggested -u to ESR (I may have sent him a patch
>too...) back when bogofilter had only the Graham method.
Hi Matt,
So it's _you_ we have to blame !!!!! Actually, I should share that blame
with you. At the time of Eric's last release, the status of "update" was
"too hard to implement, given the current architecture". I reorganized the
code so that classification and registration (the two main activities)
shared a routine called collect_words. With that change, it became easy to
make "update" work.
So, I guess we share the guilt - you had the idea and I did the dirty work.
>The idea was that it would save labor -- it gives you the benefit of
>training on your entire mail corpus with no more manual labor work
>than a simple train on error approach. I never thought of it as an
>option without drawbacks -- you must train on all errors in a timely
>fashion -- but used properly it is more convenient for me.
I agree totally with this. My unsures get put into a mailbox of their
own. I then split them into files labeled "unsure-good" and "unsure-spam"
and a cronjob runs every hour and adds them to the proper wordlist.
>I went on using -u with Robinson-Fischer without realizing that it
>wasn't training on unsures. I think -u is much less useful with an
>'unsure' state.
RespectfuLy, I disagree with you on this. I find that "unsure" adds value.
>To rectify this, I think either:
>
> (a) There should be an option to turn Robinson-Fisher into a
> binary algorithm (i.e. treat 'unsure' as 'good'). Call it the
> "benefit of the doubt" option.
>
> (b) -u should update the 'good' list with 'unsures'
>
>I vote for both! But if (a) happens, (b) comes for free.
Bogofilter currently, i.e. the beta version, has two relevant parameters
which are named "spam_cutoff" and "ham_cutoff". Any message with a score
greater than or equal to spam_cutoff is spam. If ham_cutoff is zero, all
other messages are considered ham (and none are classified as unsure). If
ham_cutoff is non-zero, then its the dividing line between ham and unsure.
>I vote for (a) because I don't much care if the filter is unsure or
>not, I just want to know if the message *is* SPAM. If it is, put it
>in the SPAM folder. If it isn't, filter it to wherever else it'll go.
>If there is a false negative, I will catch it and retrain bogofilter
>regardless of whether it was 'unsure' or 'good'.
So, if you set ham_cutoff=0 in your config file, you will have what you want.
>I vote for (b) as well. If -u updates an 'unsure' as 'good' and the
>message really is good, I have no additional work to do. If -u
>updates an 'unsure' as 'good' and the message is SPAM, I'll notice
>eventually and retrain.
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com
More information about the Bogofilter
mailing list