classification error ???

Mark M. Hoffman mhoffman at lightlink.com
Fri Sep 13 22:30:04 CEST 2002


* David Relson <relson at osagesoftware.com> [2002-09-13 15:13:03 -0400]:
<snip>
> 
> 1.  The first word encountered has probability .8 and will go into the 
> stats.extrema array.
> 2.  The second word has a higher probability, e.g. .85.  It will replace 
> the first word.
> 3.  All the other words have probability less than .8.  14 of them will be 
> used to fill the array.
> 
> The problem is that the first word should _remain_ in the array, but doesn't.

I stand corrected, as that is true.  What a nasty bug.  The list of extremes
should be sorted so that you always drop the lowest one when you insert a
new one.  I'll code it up this weekend unless you beat me to it. :)

> I think I've spotted a second problem.  If you remember, the subject of the 
> spam message was "hello babe".  "babe" has a spam indication probability of 
> .888350.  In the final 15, are several words with probability of .879424, 
> but not "babe".

Could be explained by the first problem?  "babe" was probably replaced by a
subsequent token w/ probability e.g. 0.99 or 0.01.

Regards,

-- 
Mark M. Hoffman
mhoffman at lightlink.com


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list