junk test
John McCain
jmccain at layer3al.com
Wed May 28 20:48:25 CEST 2003
So is the natural behavior of Bogofilter going to be to tend to increase
spamminess score based on the number of junk tokens?
On Wednesday 28 May 2003 12:47 pm, Peter Bishop wrote:
> I think I was a bit lax with the junk test
> I think a better test is for the string of consonants
> The grep command is now:
>
> bogoutil -d .bogofilter/spamlist.db \
>
> | grep -P "[bcdfghjklmnpqrstvwxz]{5}\.\*\\b1\\b" | wc -l
>
> In the list below singleton tokens of the same type are shown in brackets
>
> any string: : 72568 (54987)
> consonants{5} 10060 (9861)
> consonants{6} 6059 (5968)
> consonants{7} 3588 (3559)
>
> Tokens selected for junkiness also have a high prob of being singletons,
> Also there are fewer junky tokens on the goodlist, see below
>
> any string: : 16380 (7739)
> consonants{5} 159 (105)
> consonants{6} 61 (43)
> consonants{7} 10 (7)
>
> So junkiness could be a good discriminator for spam
More information about the Bogofilter
mailing list