singletons
    Jef Poskanzer 
    jef at acme.com
       
    Mon Sep  8 19:37:37 CEST 2003
    
    
  
>It is the random strings that are really annoying as they fill up
>up the database but are unlikely ever to be seen again in other emails 
Yeah.  Hey, could bogofilter somehow use a large excess of singleton
tokens as a spam marker in and of itself?
Let's see, counting the singletons in my wordlist (50 megabytes,
1 million tokens) shows only a slight excess of spam singeltons:
    spam singletons: 437830
    ham singletons:  310552
(Note that 3/4s of my wordlist is singletons!)  So that doesn't help.
However it seems likely that the distribution is not even - some
varieties of spam have a whole lot of singletons, other spams (and hams)
have some lower typical number.  Counting singletons in an actual corpus
of spam and ham should be the next step I guess, but I'm not really set
up to do that.  Maybe someone else would like to experiment.
---
Jef
         Jef Poskanzer  jef at acme.com  http://www.acme.com/jef/
    
    
More information about the bogofilter
mailing list