randomtrain observation
Peter Bishop
pgb at adelard.com
Mon May 19 10:35:09 CEST 2003
I am not familiar with randomtrain
but I assume that the databases start off being empty.
I see the 94 good messages are registered with 9404 tokens
versus 3099 spams with 64195 tokens
Maybe the difference is because there a built-in bias to believe
an unknown word is good because the robx parameter < 0.5.
So email with lots of unknown words are viewed as good
ditto spams with unknown words - and these would then be reclassified and
registered.
On 18 May 2003 at 17:31, David Relson wrote:
> spam reg good reg
> 10525 3099 24836 94
>
> I've also used bogoutil and wc to print the number of tokens in each
> wordlist:
>
> spamlist 64195
> goodlist 9202
>
> The training rate is approx 30% of spam and 4% of ham. The wordcounts are
> approx 20 per spam message and 100 per ham message.
>
> These numbers lead me to think that spam is much more varied in content than
> is ham, hence bogofilter needs many more spam tokens than ham tokens in
> order to classify messages correctly.
>
> 'Tis interesting that ham appears so much easier to classify correctly...
>
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list