Accuracy is lacking

David Relson relson at osagesoftware.com
Mon Feb 17 14:10:15 CET 2003


At 09:15 PM 2/16/03, Tracy R Reed wrote:
>On Thu, Feb 13, 2003 at 04:29:50PM -0500, David Relson spake thusly:
> > Am I right to guess that you're using the default configuration for
> > 0.10.1.5, i.e. the robinson algorithm?  It might be valuable to switch to
> > Robinson-Fisher (via "-f" on the command line or "algorithm=fisher" in the
> > config file).  RF tends to polarize the results with spam shifted towards
>
>Yes, I am using the default config. I have added the -f so we'll see what
>happens. If this is normal for the default config it seems that the
>default config isn't too useful for blocking spam. :)

Tracy,

Like any bayesian spam filter, bogofilter depends on its training to build 
its wordlists (knowledge base).  It does the best it can, given what's 
known to it.  Over time, as it's trained with an increasing number of 
messages, it's accuracy will improve.

Additionally, in the six months or so that bogofilter has existed, the 
algorithm has been undergoing development and refinement.  Initially there 
was the Graham algorithm, which used the 15 words that had the highest ham 
or spam scores.  The Robinson-GM (geometric mean) method followed as a 
refinement that used all words of the message to generate the ham/spam 
score.  More recently, the Fisher algorithm has added a chi-square test 
that provides even better discrimination.  Of course, we don't just switch 
the default algorithm whenever a new variation becomes available.  There's 
a period of testing and experimentation to verify that a change is indeed 
better.  Given sufficient evidence, the default algorithm gets 
changed.  The transition from Robinson-GM to Robinson-Fisher is such an 
event and is planned for the 0.11 release at the end of this month.

The various versions of bogofilter have each been quite successful in their 
ability to separate spam from ham.  So far none of them match _my_ human 
ability to do the job.  However, I'm happy to have 95% of incoming spam be 
identified.  The remaining messages I can deal with.

I know you've been unhappy with bogofilter's ability to catch _your_ 
spam.  Can you give us the metrics of your wordlists?  The numbers will 
indicate whether there's a reasonable amount of information with which to work.

Here are some useful commands for displaying wordlist metrics:

To display message counts:  "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT"

To display wordlist sizes:  "bogoutil -d $BOGOFILTER_DIR/spamlist.db | wc 
-l" and "bogoutil -d $BOGOFILTER_DIR/goodlist.db | wc -l"


David





More information about the Bogofilter mailing list