accuracy [was: FAQ update]

Wed Feb 19 13:49:05 CET 2003

At 01:36 AM 2/19/03, Eric Hanchrow wrote:

>Thanks for taking the time to respond.
>
> >>>>> "David" == David Relson <relson at osagesoftware.com> writes:
>
>     David> For bogofilter to be effective, it needs to be trained on
>     David> what you consider spam and what you consider ham (good).
>
>I'm well aware of that -- I'd have thought that the 7,000 spams and
>10,000 hams on which I trained it would have been enough ...
>
>     David> The best way I know to handle it is to appreciate the 85%
>     David> that is caught and use the other 15% to train bogofilter so
>     David> that next week (or month) it can catch even more.
>
>Yeah, I've been doing that ... seriously, if an 85% snag rate is to be
>expected, then that's OK.  But everything I'd heard led me to believe
>that 99% was typical, and that therefore I'd done something wrong
>... only I could never figure out what.

Eric,

7000 and 10000 should be fine.  I had no idea of your experience level or 
wordlist size, so had to ask.

I only recall one report of 99% success.  90-95% is more typical (as far as 
I know).

As a guess, it sounds like you have a wide variety of spam and that it 
includes such a large, varied vocabulary that bogofilter can't recognize it 
as spam.  Have you any sense that this might be the case?

Have you looked at the false negatives using "-vv" or "-vvv"?  "-vv" will 
give you a histogram showing how many tokens were in each range of 0.10 and 
"-vvv" will give you the Rtable which shows the tokens parsed from the 
message and their total word count (good+bad), their good and bad 
probabilities, etc.  Looking at those outputs may help you to understand 
why bogofilter isn't doing better.

David

P.S.  Please reply through the mailing list so that others may see and 
contribute to the conversation.