3 char tokens

David Relson relson at osagesoftware.com
Wed Dec 10 17:35:29 CET 2003


My wordlist currently has 1,041,090 tokens in it.  Of them, 27,285 are
short (3 character) tokens.  I was curious as to their scores, i.e.
whether they were hammish, spammish, or neutral.  Here's a histogram:

   int   cnt   prob  spamicity histogram
  0.00 14224 0.004770 0.002779
################################################
  0.10   382 0.151431 0.005297 ##
  0.20   422 0.248245 0.010070 ##
  0.30   357 0.337225 0.015794 ##
  0.40   536 0.433482 0.027359 ##
  0.50   992 0.572266 0.057261 ####
  0.60   267 0.654322 0.066722 #
  0.70   527 0.742015 0.089286 ##
  0.80   528 0.841987 0.117661 ##
  0.90  7368 0.993478 0.459046 #########################

It was generated with (roughly):

bogoutil -d wordlist.db | egrep "^... " | awk '{print $1}' | bogofilter
-C -d . -vv -F





More information about the Bogofilter mailing list