Dealing with wordlist mails
Pavel Kankovsky
peak at argo.troja.mff.cuni.cz
Thu Jan 29 01:16:45 CET 2004
On Wed, 28 Jan 2004, Lars Clausen wrote:
> I saw on my run-through of bogofiltered mail today that a huge number of
> mails had a bunch of random (but not nonsense) words attached. Many of
> these had bogosity of 0.50000, which is a bad sign, as some ham mails
> come over that.
I myself have found I can improve bf's ability to detect these mails when
I increase robs. Here are my results with the default parameters:
argo:/home/peak $ bogofilter -m 0.1,0.01,0.415 -vv < /tmp/mail1.txt
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500141, version=0.15.13.1
int cnt prob spamicity histogram
0.00 41 0.020182 0.005384 ####################
0.10 29 0.149160 0.028616 ##############
0.20 58 0.248503 0.088161 ############################
0.30 58 0.365597 0.157703 ############################
0.40 0 0.000000 0.157703
0.50 0 0.000000 0.157703
0.60 40 0.648987 0.243919 ####################
0.70 22 0.742149 0.290387 ###########
0.80 13 0.857980 0.323672 #######
0.90 100 0.992361 0.552124 ################################################
argo:/home/peak $ bogofilter -m 0.1,0.01,0.415 -vv < /tmp/mail2.txt
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.510041, version=0.15.13.1
int cnt prob spamicity histogram
0.00 18 0.008210 0.002171 #################
0.10 6 0.159286 0.015802 ######
0.20 16 0.251414 0.064326 ###############
0.30 16 0.354331 0.121187 ###############
0.40 0 0.000000 0.121187
0.50 0 0.000000 0.121187
0.60 13 0.632357 0.205888 ############
0.70 8 0.732416 0.258337 ########
0.80 2 0.865552 0.275877 ##
0.90 53 0.991740 0.567665 ################################################
And here are the results with robs 0.2:
argo:/home/peak $ bogofilter -m 0.1,0.2,0.415 -vv < /tmp/mail1.txt
X-Bogosity: Yes, tests=bogofilter, spamicity=0.908748, version=0.15.13.1
int cnt prob spamicity histogram
0.00 41 0.061124 0.024965 ####################
0.10 26 0.150323 0.047998 #############
0.20 61 0.253550 0.116299 ##############################
0.30 58 0.366683 0.186674 ############################
0.40 0 0.000000 0.186674
0.50 0 0.000000 0.186674
0.60 40 0.646784 0.275718 ####################
0.70 22 0.739597 0.322993 ###########
0.80 13 0.856820 0.357008 #######
0.90 100 0.932070 0.532512 ################################################
argo:/home/peak $ bogofilter -m 0.1,0.2,0.415 -vv < /tmp/mail2.txt
X-Bogosity: Yes, tests=bogofilter, spamicity=0.971396, version=0.15.13.1
int cnt prob spamicity histogram
0.00 18 0.058227 0.024530 #################
0.10 6 0.166771 0.041544 ######
0.20 16 0.258334 0.098197 ###############
0.30 16 0.355187 0.158477 ###############
0.40 0 0.000000 0.158477
0.50 0 0.000000 0.158477
0.60 13 0.632079 0.250214 ############
0.70 8 0.731810 0.305476 ########
0.80 2 0.865455 0.323998 ##
0.90 53 0.916327 0.560231 ################################################
It appears a small value of robs makes bf extremely sensitive to the
appearance of "low n" tokens, esp. singletons. A message full of random
words is likely to hit a lot of these tokens, ergo the message will be
full of both strong ham indicators and strong spam indicators, ergo it
will get a score near 0.5. A higher value of robs makes "low n" tokens
less significant and lets bf pay more attention to other tokens.
YMMV.
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the Bogofilter
mailing list