3 char tokens
David Relson
relson at osagesoftware.com
Wed Dec 10 17:35:29 CET 2003
My wordlist currently has 1,041,090 tokens in it. Of them, 27,285 are
short (3 character) tokens. I was curious as to their scores, i.e.
whether they were hammish, spammish, or neutral. Here's a histogram:
int cnt prob spamicity histogram
0.00 14224 0.004770 0.002779
################################################
0.10 382 0.151431 0.005297 ##
0.20 422 0.248245 0.010070 ##
0.30 357 0.337225 0.015794 ##
0.40 536 0.433482 0.027359 ##
0.50 992 0.572266 0.057261 ####
0.60 267 0.654322 0.066722 #
0.70 527 0.742015 0.089286 ##
0.80 528 0.841987 0.117661 ##
0.90 7368 0.993478 0.459046 #########################
It was generated with (roughly):
bogoutil -d wordlist.db | egrep "^... " | awk '{print $1}' | bogofilter
-C -d . -vv -F
More information about the Bogofilter
mailing list