junk test

John McCain jmccain at layer3al.com
Wed May 28 20:48:25 CEST 2003


So is the natural behavior of Bogofilter going to be to tend to increase 
spamminess score based on the number of junk tokens?

On Wednesday 28 May 2003 12:47 pm, Peter Bishop wrote:
> I think I was a bit lax with the junk test
> I think a better test is for the string of consonants
> The grep command is now:
>
> bogoutil -d .bogofilter/spamlist.db \
>
> | grep -P "[bcdfghjklmnpqrstvwxz]{5}\.\*\\b1\\b" | wc -l
>
> In the list below singleton tokens of the same type are shown in brackets
>
> any string:    : 72568 (54987)
> consonants{5} 10060  (9861)
> consonants{6}  6059  (5968)
> consonants{7}  3588  (3559)
>
> Tokens selected for junkiness also have a high prob of being singletons,
> Also there are fewer junky tokens on the goodlist, see below
>
> any string:    : 16380 (7739)
> consonants{5} 159  (105)
> consonants{6}  61  (43)
> consonants{7}  10  (7)
>
> So junkiness could be a good discriminator for spam





More information about the Bogofilter mailing list