no filtering without ham

Sun Jan 18 14:16:23 CET 2009

On Sun, 18 Jan 2009 04:20:42 +0100
Matthias Andree wrote:

...[snip]...

> > - bogofilter could start working after it was trained with spam
> > emails only and get the ham stats after that, when users mark false
> > positives.
> 
> Conceivable (with a new command line option, we do not want to change
> default behaviour), but I haven't thought about the consequences yet.
> The implementation might just consider all tokens that bogofilter does
> not know "ham" - but we'll probably need to tune the default
> parameters in such a mode, and we might have to change the algorithm
> as well -- I never looked too closely at it. AFAIR from the cursory
> glances I had at the discussions, it's Bayesian estimation for each
> individual (or compound, if so configured) token with a χ²
> (chi-square) test to make up the overall spamicity.  I'm not this is
> all feasible or sensible.

With no training, tokens are scored at 0.52 (the 'robx' parameter of
the Robinson equations) and messages containing only such tokens will
have the same score and will be classified as 'unsure'.

One _can_ let bogofilter score that way and no messages will be
'spam'.  Correcting the false negatives (spam not scored as spam) will
begin the process of training.  It will also result in false positives
(ham scored as spam).  Correcting the false positives will bring
scoring back into balance.

This process works -- but will requires user diligence because the
error rate (numbers of false positives and negatives) will be
significant.