Random thought
Peter Bishop
pgb at adelard.com
Fri Aug 15 19:09:12 CEST 2003
The Better Bayesian discussion, triggered the following random thought
( though it might be a really bad idea ...)
Would discrimination be improved if we somehow took
account of the *number of times* a token is used in a message?
e.g.
Free! Free!
is weighted more heavilly than
Free!
One way of doing this is to apply the existing spamicity algorithm
to *every* word in the message rather to unique tokens.
Not sure that is the whole story though, I suspect we would also need
to count every token occurrence in the training database rather than
just adding 1 if a token is seen in a message.
The reason this might be a really bad idea is that it
could make the filter more vulnerable to a spam "padding ploy"
- a long sequence of a single hammy word could fool the filter into
believing it really is ham.
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list