better Bayesian bogofilter

Wed Aug 13 14:43:44 CEST 2003

At 08:00 AM 8/13/03, Boris 'pi' Piwinger wrote:
>Greg Louis wrote:
>
> >> What I'm more interested in knowing is exactly _how_ you plan to keep 
> track
> >> of the ham/spam ratio.  One thought that crosses my mind is having a
> >> ".SCORE" token rather like .MSG_COUNT.  If I understand your article,
> >> .SCORE needs to be updated for each ham and each spam scored.
> >
> > That was my first idea, yes.
>
>What is wrong with .MSG_COUNT, as long as you make sure you
>don't diviede by zero?

The Bayesian theorem calls for knowing the real ratio of ham/spam in 
incoming messages.  .MSG_COUNT gives the training counts.  The two numbers 
are different unless training is done with _all_ incoming messages.

> > What I intend to do first is implement
> > Eq. #5 with a single parameter that can be set manually
>
>Which would almost impossible to use for train on error. So
>this part of the test would not work.

Remember that Greg wants to test the effect of accurate ratios vs 
inaccurate ratios.  If the test shows that accuracy doesn't matter, then 
there's no need to implement the feature.  If the test shows it _does_ 
matter, that's the time to figure out the best implementation.