religion

Matt Armstrong matt at lickey.com
Wed Jan 22 17:09:56 CET 2003


Matthias Andree <matthias.andree at gmx.de> writes:

> Greg Louis <glouis at dynamicro.on.ca> writes:
>
>> No, but I don't grok your reasoning here.  Using -u does nothing to
>> help with unsures; -u trains on recognized spams and nonspams only,
>> which I believe has little training value.
>
> The idea behind that AFAIR was: if we know for sure what kind of
> mail it is, we may have some tokens pretty indicative and some
> tokens that are not strong about if it's ham or spam more likely --
> but we might want to register those unsure tokens nonetheless.
>
> I haven't done any research if "just train on unsures and
> misclassifications" or "train on every mail" is more effective.

I'm the guy who suggested -u to ESR (I may have sent him a patch
too...) back when bogofilter had only the Graham method.

The idea was that it would save labor -- it gives you the benefit of
training on your entire mail corpus with no more manual labor work
than a simple train on error approach.  I never thought of it as an
option without drawbacks -- you must train on all errors in a timely
fashion -- but used properly it is more convenient for me.

I went on using -u with Robinson-Fischer without realizing that it
wasn't training on unsures.  I think -u is much less useful with an
'unsure' state.

To rectify this, I think either:

    (a) There should be an option to turn Robinson-Fisher into a
        binary algorithm (i.e. treat 'unsure' as 'good').  Call it the
        "benefit of the doubt" option.

    (b) -u should update the 'good' list with 'unsures'

I vote for both!  But if (a) happens, (b) comes for free.

I vote for (a) because I don't much care if the filter is unsure or
not, I just want to know if the message *is* SPAM.  If it is, put it
in the SPAM folder.  If it isn't, filter it to wherever else it'll go.
If there is a false negative, I will catch it and retrain bogofilter
regardless of whether it was 'unsure' or 'good'.

I vote for (b) as well.  If -u updates an 'unsure' as 'good' and the
message really is good, I have no additional work to do.  If -u
updates an 'unsure' as 'good' and the message is SPAM, I'll notice
eventually and retrain.




More information about the Bogofilter mailing list