singletons

Jef Poskanzer jef at acme.com
Mon Sep 8 19:37:37 CEST 2003


>It is the random strings that are really annoying as they fill up
>up the database but are unlikely ever to be seen again in other emails 

Yeah.  Hey, could bogofilter somehow use a large excess of singleton
tokens as a spam marker in and of itself?

Let's see, counting the singletons in my wordlist (50 megabytes,
1 million tokens) shows only a slight excess of spam singeltons:
    spam singletons: 437830
    ham singletons:  310552
(Note that 3/4s of my wordlist is singletons!)  So that doesn't help.
However it seems likely that the distribution is not even - some
varieties of spam have a whole lot of singletons, other spams (and hams)
have some lower typical number.  Counting singletons in an actual corpus
of spam and ham should be the next step I guess, but I'm not really set
up to do that.  Maybe someone else would like to experiment.
---
Jef

         Jef Poskanzer  jef at acme.com  http://www.acme.com/jef/




More information about the Bogofilter mailing list