better Bayesian bogofilter

Wed Aug 13 13:47:41 CEST 2003

On 20030812 (Tue) at 1939:53 -0400, David Relson wrote:

> _Any_ change that results in different scores being computed will "break" 
> many of the regression tests.  After all, their purpose is to raise a red 
> flag when bogofilter gets results different from those expected, i.e. when 
> bogofilter "regresses".

I suspected as much.

> What I'm more interested in knowing is exactly _how_ you plan to keep track 
> of the ham/spam ratio.  One thought that crosses my mind is having a 
> ".SCORE" token rather like .MSG_COUNT.  If I understand your article, 
> .SCORE needs to be updated for each ham and each spam scored.

That was my first idea, yes.  What I intend to do first is implement
Eq. #5 with a single parameter that can be set manually, in the same
style as s and x and so on.  That will allow us to do some testing to
see how current that ratio needs to be.  (Just keeping track of all
messages that bogofilter classifies is fine for a while, but it will be
slow to track changes like what we've seen this year: 30% spam in
January, 60% in June.)  While that's going on we can devote some
thought to how best to track the ratio without slowing bogofilter down
too much.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |