Scoring Parameters - Old vs New

Fri Apr 2 01:52:59 CEST 2004

On Wed, 31 Mar 2004 14:50:32 +0100
Peter Bishop wrote:

> On 1 Apr 2004 at 7:43, David Relson wrote:
> 
> > For testing I used the 80,500 ham and 67,500 spam accumulated in the
> > 18 months I've been running bogofilter.  I was curious about the
> > effects of full training vs. a small training vs. a large training
> > set:
> > 
> >     small - trained with 10% and scored 90%
> >     large - trained with 90% and scored 10%
> >     full  - trained and scored all messages
> 
> I think there could be some bias in these tests. Really you should 
> test with *different* hams and spams to the ones you trained on.
> 
> Currently the overlap between  test and training sets is 10% for the 
> small test, 90% for the large test and 100% for the full test.

Peter,

Your assumption is incorrect.  The small test splits the corpora 10%/90%
and uses the 10% for training and the 90% for scoring.  The large test
splits it 90%/10%.  For both these tests, there is _no_ overlap between
training and scoring.  The full test uses the same messages for training
and scoring.  The purpose of all the tests are to see if the three
parameter sets (old, osa, and new) generate comparable results.

> If we assume that bogofilter scores better on emails it has already 
> been trained on, this might explain why you get much lower false 
> positive numbers with full training than 90% training,
> e.g. for "new" you get 
> 
> 3 fp with 90% training
> 0 fp with 100% training
> 
> This is a big difference for a relatively small increase in the  
> training.database.
> 
> But it is what you would expect if it is only the emails that *have 
> not* been used for training that are likely appear in the false 
> positive scores

The "full" test is an "after-the-fact" scoring, i.e. after
classification is known and the wordlist built accordingly.  The other
tests are "predictive", i.e. given some info on ham vs. spam, score not
before seen messages and see how well bogofilter does with the training
database and the parameter sets.