extreme wierdness with RF & 0.10.0
David Relson
relson at osagesoftware.com
Tue Jan 21 04:12:29 CET 2003
At 09:57 PM 1/20/03, Barry Gould wrote:
>I've just put in 0.10.0 using Robinson-Fisher.
>
>I sent my users a message regarding the new bogofilter, and CC'd myself.
>
>Strangely, it got tagged as Unsure with a score of 0.179478
>
>Even stranger is the output of bogofilter -vvv
>
>I get a score of 0.189874 when I run it back through bogofilter (note the
>headers are different at this point due to the bogofilter tag and whatever
>Eudora added... no problem).
>
>However, when looking for words that would have caused it to be Unsure, I
>see many words with probabilities above 1.0!! Shouldn't this be impossible?!?
>
>Even ones with both probabilities above 1.0!!!
>
>example:
> n pgood pbad fw
> invfwlog fwlog U
>"for" 123254 3.213524 2.704072 0.456954
>-0.61056 -0.78317 -
>"pennysaverusa.net" 85321 2.684474 0.121062 0.043151
>-0.04411 -3.14305 +
>"from" 138213 3.374996 3.902206 0.536223
>-0.76835 -0.62320 -
What are the values of .MSG_COUNT? "bogofilter -w /path/to/wordlists
.MSG_COUNT" will give the info.
Graham allows up to 4 points (repetitions) per token per message, so a
token's count can easily exceed the number of messages in the word
list. Robinson uses a max of 1 (instead of 4). I'll bet that "for",
"pennysaverusa.net", and "from" are in _every_ message you have, often
moree than once. I can generate a quick patch to that'll make those big
scores go away. Are you up for building from source?
More information about the Bogofilter
mailing list