Dealing with wordlist mails

Thu Jan 29 01:16:45 CET 2004

On Wed, 28 Jan 2004, Lars Clausen wrote:

> I saw on my run-through of bogofiltered mail today that a huge number of
> mails had a bunch of random (but not nonsense) words attached.  Many of
> these had bogosity of 0.50000, which is a bad sign, as some ham mails
> come over that.  

I myself have found I can improve bf's ability to detect these mails when
I increase robs. Here are my results with the default parameters:

argo:/home/peak $ bogofilter -m 0.1,0.01,0.415 -vv < /tmp/mail1.txt 
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500141, version=0.15.13.1
   int  cnt   prob  spamicity histogram
  0.00   41 0.020182 0.005384 ####################
  0.10   29 0.149160 0.028616 ##############
  0.20   58 0.248503 0.088161 ############################
  0.30   58 0.365597 0.157703 ############################
  0.40    0 0.000000 0.157703 
  0.50    0 0.000000 0.157703 
  0.60   40 0.648987 0.243919 ####################
  0.70   22 0.742149 0.290387 ###########
  0.80   13 0.857980 0.323672 #######
  0.90  100 0.992361 0.552124 ################################################

argo:/home/peak $ bogofilter -m 0.1,0.01,0.415 -vv < /tmp/mail2.txt 
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.510041, version=0.15.13.1
   int  cnt   prob  spamicity histogram
  0.00   18 0.008210 0.002171 #################
  0.10    6 0.159286 0.015802 ######
  0.20   16 0.251414 0.064326 ###############
  0.30   16 0.354331 0.121187 ###############
  0.40    0 0.000000 0.121187 
  0.50    0 0.000000 0.121187 
  0.60   13 0.632357 0.205888 ############
  0.70    8 0.732416 0.258337 ########
  0.80    2 0.865552 0.275877 ##
  0.90   53 0.991740 0.567665 ################################################

And here are the results with robs 0.2:

argo:/home/peak $ bogofilter -m 0.1,0.2,0.415 -vv < /tmp/mail1.txt 
X-Bogosity: Yes, tests=bogofilter, spamicity=0.908748, version=0.15.13.1
   int  cnt   prob  spamicity histogram
  0.00   41 0.061124 0.024965 ####################
  0.10   26 0.150323 0.047998 #############
  0.20   61 0.253550 0.116299 ##############################
  0.30   58 0.366683 0.186674 ############################
  0.40    0 0.000000 0.186674 
  0.50    0 0.000000 0.186674 
  0.60   40 0.646784 0.275718 ####################
  0.70   22 0.739597 0.322993 ###########
  0.80   13 0.856820 0.357008 #######
  0.90  100 0.932070 0.532512 ################################################

argo:/home/peak $ bogofilter -m 0.1,0.2,0.415 -vv < /tmp/mail2.txt 
X-Bogosity: Yes, tests=bogofilter, spamicity=0.971396, version=0.15.13.1
   int  cnt   prob  spamicity histogram
  0.00   18 0.058227 0.024530 #################
  0.10    6 0.166771 0.041544 ######
  0.20   16 0.258334 0.098197 ###############
  0.30   16 0.355187 0.158477 ###############
  0.40    0 0.000000 0.158477 
  0.50    0 0.000000 0.158477 
  0.60   13 0.632079 0.250214 ############
  0.70    8 0.731810 0.305476 ########
  0.80    2 0.865455 0.323998 ##
  0.90   53 0.916327 0.560231 ################################################

It appears a small value of robs makes bf extremely sensitive to the
appearance of "low n" tokens, esp. singletons. A message full of random
words is likely to hit a lot of these tokens, ergo the message will be
full of both strong ham indicators and strong spam indicators, ergo it
will get a score near 0.5. A higher value of robs makes "low n" tokens
less significant and lets bf pay more attention to other tokens.

YMMV.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."