Accuracy is lacking
David Relson
relson at osagesoftware.com
Mon Feb 17 14:10:15 CET 2003
At 09:15 PM 2/16/03, Tracy R Reed wrote:
>On Thu, Feb 13, 2003 at 04:29:50PM -0500, David Relson spake thusly:
> > Am I right to guess that you're using the default configuration for
> > 0.10.1.5, i.e. the robinson algorithm? It might be valuable to switch to
> > Robinson-Fisher (via "-f" on the command line or "algorithm=fisher" in the
> > config file). RF tends to polarize the results with spam shifted towards
>
>Yes, I am using the default config. I have added the -f so we'll see what
>happens. If this is normal for the default config it seems that the
>default config isn't too useful for blocking spam. :)
Tracy,
Like any bayesian spam filter, bogofilter depends on its training to build
its wordlists (knowledge base). It does the best it can, given what's
known to it. Over time, as it's trained with an increasing number of
messages, it's accuracy will improve.
Additionally, in the six months or so that bogofilter has existed, the
algorithm has been undergoing development and refinement. Initially there
was the Graham algorithm, which used the 15 words that had the highest ham
or spam scores. The Robinson-GM (geometric mean) method followed as a
refinement that used all words of the message to generate the ham/spam
score. More recently, the Fisher algorithm has added a chi-square test
that provides even better discrimination. Of course, we don't just switch
the default algorithm whenever a new variation becomes available. There's
a period of testing and experimentation to verify that a change is indeed
better. Given sufficient evidence, the default algorithm gets
changed. The transition from Robinson-GM to Robinson-Fisher is such an
event and is planned for the 0.11 release at the end of this month.
The various versions of bogofilter have each been quite successful in their
ability to separate spam from ham. So far none of them match _my_ human
ability to do the job. However, I'm happy to have 95% of incoming spam be
identified. The remaining messages I can deal with.
I know you've been unhappy with bogofilter's ability to catch _your_
spam. Can you give us the metrics of your wordlists? The numbers will
indicate whether there's a reasonable amount of information with which to work.
Here are some useful commands for displaying wordlist metrics:
To display message counts: "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT"
To display wordlist sizes: "bogoutil -d $BOGOFILTER_DIR/spamlist.db | wc
-l" and "bogoutil -d $BOGOFILTER_DIR/goodlist.db | wc -l"
David
More information about the Bogofilter
mailing list