classification error ???
Mark M. Hoffman
mhoffman at lightlink.com
Fri Sep 13 20:05:34 CEST 2002
* David Relson <relson at osagesoftware.com> [2002-09-13 13:31:56 -0400]:
<snip>
> I've been reading the code, and I think the problem is much worse than
> that. Here's what I think function bogofilter() does:
>
> 1 - clear out the stats.extrema array
> 2 - loop over get_token for new words
> - for each word:
> get hamness and spamness counts
> compute probability
> see if word is in stats.extrema array
> if not, add word to array
> note: size of array is KEEPERS, i.e. 15
> 3 - compute probability from stats.extrema array
> 4 - return spam/nonspam status
>
> The problem is that NO determination is made about whether the word is
> interesting, i.e. a strong indicator of spamness or goodness. The loop
> simply puts the FIRST 15 words into the stats.extrema array.
As of 0.7, your (2) above is this...
for each word
get counts
compute prob
compute dev(iation)
for each entry in stats.extrema
if entry != word
if word's dev is > than this entry and > than one earlier marked
mark that entry for replacement
if an entry was marked
put (replace) the word there
So it does replace lower ranked words... right up until it has KEEPERS all
with max deviation. At that point it may as well exit the <for each word>
loop.
Regards,
--
Mark M. Hoffman
mhoffman at lightlink.com
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list