Accuracy is lacking

Sat Feb 15 13:57:11 CET 2003

At 01:13 AM 2/15/03, Nick Simicich wrote:
>At 08:30 AM 2003-02-14 -0500, David Relson wrote:
>
>>It'll be amusing to see what score bogofilter gives to _this_ message 
>>(given its mix of ham and spam subjects).
>
>X-Bogosity: No, tests=bogofilter, spamicity=0.128178, version=0.10.2
>
>My guess is that you are right, and that the numbers vary from person to 
>person. But one big change for me was when bodies were parsed, and I got 
>another shift when I repaired my databases - I believe that there were 
>things happening in the broken databases that I simply did not 
>understand.  Just as a side comment, I reorganized my databases again 
>today and the databases properly verified and properly dumped and restored.

Nick,

Right you are.  Mime parsing, in particular content decoding, has a 
significant effect on bogofilter's results.  Previously base64 and 
uuencoded text showed up as excessively long tokens, i.e. longer than 
MAXTOKENLENGTH (30), and were discarded.  Now that encoded text is scored - 
a _big_ change.

Fixing broken databases could also be a big factor.  Broken could just mean 
"can't lookup tokens beginning with 'z'" or it could mean "can't find 'c' 
to 't'".  Either way fixing it matters!

Glad to hear that verification/dump/restore worked properly.  I know 
Matthias put great effort into having locking work properly so that 
databases wouldn't go bad.  It's good to have confirmation of that.

David