Random thought

Fri Aug 15 21:28:56 CEST 2003

Peter,

The early versions of bogofilter used the graham algorithm with a max count 
of 4 per token.  When we switched to the Robinson algorithm, the max 
counter per token was lowered to 1.

If I recall, you just need to find/change variable 'max_repeats' to change 
the behavior.  If you decide to run an experiment, let us know how it goes :-)

David

At 01:09 PM 8/15/03, Peter Bishop wrote:
>The Better Bayesian discussion, triggered the following random thought
>( though it might be a really bad idea ...)
>
>Would discrimination be improved if we somehow took
>account of the *number of times* a token is used in a message?
>
>e.g.
>
>Free! Free!
>
>is weighted more heavilly than
>
>Free!
>
>One way of doing this is to apply the existing spamicity algorithm
>to *every* word in the message rather to unique tokens.
>
>Not sure that is the whole story though, I suspect we would also need
>to count every token occurrence in the training database rather than
>just adding 1 if a token is seen in a message.
>
>The reason this might be a really bad idea is that it
>could make the filter more vulnerable to a spam "padding ploy"
>- a long sequence of a single hammy word could fool the filter into
>believing it really is ham.
>--
>Peter Bishop
>pgb at adelard.com
>pgb at csr.city.ac.uk
>
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com