extreme wierdness with RF & 0.10.0

David Relson relson at osagesoftware.com
Tue Jan 21 04:12:29 CET 2003


At 09:57 PM 1/20/03, Barry Gould wrote:
>I've just put in 0.10.0 using Robinson-Fisher.
>
>I sent my users a message regarding the new bogofilter, and CC'd myself.
>
>Strangely, it got tagged as Unsure with a score of 0.179478
>
>Even stranger is the output of bogofilter -vvv
>
>I get a score of 0.189874 when I run it back through bogofilter (note the 
>headers are different at this point due to the bogofilter tag and whatever 
>Eudora added... no problem).
>
>However, when looking for words that would have caused it to be Unsure, I 
>see many words with probabilities above 1.0!! Shouldn't this be impossible?!?
>
>Even ones with both probabilities above 1.0!!!
>
>example:
>                                      n     pgood      pbad        fw 
> invfwlog     fwlog U
>"for"                            123254  3.213524  2.704072  0.456954 
>-0.61056  -0.78317 -
>"pennysaverusa.net"              85321  2.684474  0.121062  0.043151 
>-0.04411  -3.14305 +
>"from"                           138213  3.374996  3.902206  0.536223 
>-0.76835  -0.62320 -

What are the values of .MSG_COUNT?  "bogofilter -w /path/to/wordlists 
.MSG_COUNT" will give the info.

Graham allows up to 4 points (repetitions) per token per message, so a 
token's count can easily exceed the number of messages in the word 
list.  Robinson uses a max of 1 (instead of 4).  I'll bet that "for", 
"pennysaverusa.net", and "from" are in _every_ message you have, often 
moree than once.  I can generate a quick patch to that'll make those big 
scores go away.  Are you up for building from source?






More information about the Bogofilter mailing list