randomtrain observation

Mon May 19 10:35:09 CEST 2003

I am not familiar with randomtrain
but I assume that the databases start off being empty.

I see the 94 good messages are registered with 9404 tokens
versus 3099 spams with 64195 tokens

Maybe the difference is because there a built-in bias to believe
an unknown word is good because the robx parameter < 0.5.
So email with lots of unknown words are viewed as good
ditto spams with unknown words - and these would then be reclassified and 
registered.

On 18 May 2003 at 17:31, David Relson wrote:

> spam  reg   good reg
> 10525 3099  24836  94
> 
> I've also used bogoutil and wc to print the number of tokens in each
> wordlist:
> 
> spamlist 64195
> goodlist  9202
> 
> The training rate is approx 30% of spam and 4% of ham.  The wordcounts are
> approx 20 per spam message and 100 per ham message.
> 
> These numbers lead me to think that spam is much more varied in content than
> is ham, hence bogofilter needs many more spam tokens than ham tokens in
> order to classify messages correctly.
> 
> 'Tis interesting that ham appears so much easier to classify correctly...
> 

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk