junk test

Peter Bishop pgb at adelard.com
Wed May 28 19:47:38 CEST 2003


I think I was a bit lax with the junk test
I think a better test is for the string of consonants
The grep command is now:

bogoutil -d .bogofilter/spamlist.db \
| grep -P "[bcdfghjklmnpqrstvwxz]{5}\.\*\\b1\\b" | wc -l

In the list below singleton tokens of the same type are shown in brackets

any string:    : 72568 (54987)
consonants{5} 10060  (9861)
consonants{6}  6059  (5968)
consonants{7}  3588  (3559)

Tokens selected for junkiness also have a high prob of being singletons,
Also there are fewer junky tokens on the goodlist, see below

any string:    : 16380 (7739)
consonants{5} 159  (105)
consonants{6}  61  (43)
consonants{7}  10  (7)

So junkiness could be a good discriminator for spam


-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list