classification error ???

Mark M. Hoffman mhoffman at lightlink.com
Fri Sep 13 20:05:34 CEST 2002


* David Relson <relson at osagesoftware.com> [2002-09-13 13:31:56 -0400]:
<snip>

> I've been reading the code, and I think the problem is much worse than 
> that.  Here's what I think function bogofilter() does:
> 
> 1 - clear out the stats.extrema array
> 2 - loop over get_token for new words
>    - for each word:
>          get hamness and spamness counts
>          compute probability
>          see if word is in stats.extrema array
>          if not, add word to array
>          note: size of array is KEEPERS, i.e. 15
> 3 - compute probability from stats.extrema array
> 4 - return spam/nonspam status
> 
> The problem is that NO determination is made about whether the word is 
> interesting, i.e. a strong indicator of spamness or goodness.  The loop 
> simply puts the FIRST 15 words into the stats.extrema array.

As of 0.7, your (2) above is this...

for each word
    get counts
    compute prob
    compute dev(iation)
    for each entry in stats.extrema
        if entry != word
            if word's dev is > than this entry and > than one earlier marked
                mark that entry for replacement
    if an entry was marked
        put (replace) the word there

So it does replace lower ranked words... right up until it has KEEPERS all
with max deviation.  At that point it may as well exit the <for each word>
loop.

Regards,

-- 
Mark M. Hoffman
mhoffman at lightlink.com


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list