Random thought

Fri Aug 15 19:09:12 CEST 2003

The Better Bayesian discussion, triggered the following random thought
( though it might be a really bad idea ...)

Would discrimination be improved if we somehow took 
account of the *number of times* a token is used in a message? 

e.g.

Free! Free! 

is weighted more heavilly than 

Free! 

One way of doing this is to apply the existing spamicity algorithm
to *every* word in the message rather to unique tokens.

Not sure that is the whole story though, I suspect we would also need
to count every token occurrence in the training database rather than 
just adding 1 if a token is seen in a message.

The reason this might be a really bad idea is that it 
could make the filter more vulnerable to a spam "padding ploy" 
- a long sequence of a single hammy word could fool the filter into
believing it really is ham.
-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk