singletons

David Relson relson at osagesoftware.com
Mon Sep 8 20:12:37 CEST 2003


On Mon, 08 Sep 2003 10:37:37 -0700
Jef Poskanzer <jef at acme.com> wrote:

> >It is the random strings that are really annoying as they fill up
> >up the database but are unlikely ever to be seen again in other
> >emails 
> 
> Yeah.  Hey, could bogofilter somehow use a large excess of singleton
> tokens as a spam marker in and of itself?
> 
> Let's see, counting the singletons in my wordlist (50 megabytes,
> 1 million tokens) shows only a slight excess of spam singeltons:
>     spam singletons: 437830
>     ham singletons:  310552
> (Note that 3/4s of my wordlist is singletons!)  So that doesn't help.
> However it seems likely that the distribution is not even - some
> varieties of spam have a whole lot of singletons, other spams (and
> hams) have some lower typical number.  Counting singletons in an
> actual corpus of spam and ham should be the next step I guess, but I'm
> not really set up to do that.  Maybe someone else would like to
> experiment.---
> Jef

Jef,

I've noticed that one big source of singletons is message IDs. 
Bogofilter has been discarding many of them, but not all.  With the
recent revisions of the parsing rules (so that unfolded lines are better
processed), I've improved the discard rate :-)

So, if you've got the time and the corpora, an interesting experiment
would be to rebuild your wordlist and count the singletons.  If you _do_
run the experiment, be sure to use 0.15.2 and let us know how it turns
out.

David




More information about the Bogofilter mailing list