singletons
Jef Poskanzer
jef at acme.com
Mon Sep 8 19:37:37 CEST 2003
>It is the random strings that are really annoying as they fill up
>up the database but are unlikely ever to be seen again in other emails
Yeah. Hey, could bogofilter somehow use a large excess of singleton
tokens as a spam marker in and of itself?
Let's see, counting the singletons in my wordlist (50 megabytes,
1 million tokens) shows only a slight excess of spam singeltons:
spam singletons: 437830
ham singletons: 310552
(Note that 3/4s of my wordlist is singletons!) So that doesn't help.
However it seems likely that the distribution is not even - some
varieties of spam have a whole lot of singletons, other spams (and hams)
have some lower typical number. Counting singletons in an actual corpus
of spam and ham should be the next step I guess, but I'm not really set
up to do that. Maybe someone else would like to experiment.
---
Jef
Jef Poskanzer jef at acme.com http://www.acme.com/jef/
More information about the Bogofilter
mailing list