better Bayesian bogofilter

Wed Aug 13 14:38:11 CEST 2003

At 07:47 AM 8/13/03, Greg Louis wrote:
>On 20030812 (Tue) at 1939:53 -0400, David Relson wrote:
>
> > _Any_ change that results in different scores being computed will "break"
> > many of the regression tests.  After all, their purpose is to raise a red
> > flag when bogofilter gets results different from those expected, i.e. when
> > bogofilter "regresses".
>
>I suspected as much.
>
> > What I'm more interested in knowing is exactly _how_ you plan to keep 
> track
> > of the ham/spam ratio.  One thought that crosses my mind is having a
> > ".SCORE" token rather like .MSG_COUNT.  If I understand your article,
> > .SCORE needs to be updated for each ham and each spam scored.
>
>That was my first idea, yes.  What I intend to do first is implement
>Eq. #5 with a single parameter that can be set manually, in the same
>style as s and x and so on.  That will allow us to do some testing to
>see how current that ratio needs to be.  (Just keeping track of all
>messages that bogofilter classifies is fine for a while, but it will be
>slow to track changes like what we've seen this year: 30% spam in
>January, 60% in June.)  While that's going on we can devote some
>thought to how best to track the ratio without slowing bogofilter down
>too much.

KISS says that the manual parameter for testing is the right way to go.

If we're going to keep accurate info in the wordlist, it needs to be spam 
and ham counts.  A ratio is not maintainable.  The cost of accuracy is, 
roughly, lock database, write new counts for .SCORE, unlock the 
database.  This doesn't seem excessive to me.  Of course, I use auto-update 
which is higher in cost.