junk test
Peter Bishop
pgb at adelard.com
Wed May 28 19:47:38 CEST 2003
I think I was a bit lax with the junk test
I think a better test is for the string of consonants
The grep command is now:
bogoutil -d .bogofilter/spamlist.db \
| grep -P "[bcdfghjklmnpqrstvwxz]{5}\.\*\\b1\\b" | wc -l
In the list below singleton tokens of the same type are shown in brackets
any string: : 72568 (54987)
consonants{5} 10060 (9861)
consonants{6} 6059 (5968)
consonants{7} 3588 (3559)
Tokens selected for junkiness also have a high prob of being singletons,
Also there are fewer junky tokens on the goodlist, see below
any string: : 16380 (7739)
consonants{5} 159 (105)
consonants{6} 61 (43)
consonants{7} 10 (7)
So junkiness could be a good discriminator for spam
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list