Including html-tag contents may be unnecessary
Greg Louis
glouis at dynamicro.on.ca
Sun May 11 19:46:07 CEST 2003
An experiment was performed to determine whether including contents of
html tags (supported in bogofilter 0.12.3) improves discrimination:
The machine on which this experiment was run had, in a directory
referred to below as ../smindev, files t.sp and t.ns (for training,
containing spams and nonspams respectively), and r0.sp, r1.sp, r0.ns
and r1.ns (for testing).
Training was performed, without and with html tag contents:
mkdir normdb htmldb
bogofilter -d normdb -sv <../smindev/t.sp
# 3494475 words, 9462 messages
bogofilter -d normdb -nv <../smindev/t.ns
# 10766650 words, 26732 messages
bogofilter -d htmldb -Ht -sv <../smindev/t.sp
# 5427876 words, 9462 messages
bogofilter -d htmldb -Ht -nv <../smindev/t.ns
# 10805680 words, 26732 messages
As expected, there were a lot more tokens when html tag contents were
extracted from spam messages, and a proportionately smaller increase
among the nonspams. Next the test files were classified:
for num in 0 1; do
bogofilter -d normdb -Mv <../smindev/r$num.ns >r$num.ns.norm
bogofilter -d normdb -Mv <../smindev/r$num.sp >r$num.sp.norm
bogofilter -d htmldb -Ht -Mv <../smindev/r$num.ns >r$num.ns.html
bogofilter -d htmldb -Ht -Mv <../smindev/r$num.sp >r$num.sp.html
done
These are the numbers of messages in the r?.ns and r?.sp files:
wc -l *.norm
13365 r0.ns.norm
4732 r0.sp.norm
13366 r1.ns.norm
4730 r1.sp.norm
There were some false positives in the .ns files:
grep -c '^1' *.ns.*
r0.ns.html:5
r0.ns.norm:4
r1.ns.html:9
r1.ns.norm:6
And of course there were some false negatives:
grep -c -v '^1' *.sp.*
r0.sp.html:124
r0.sp.norm:140
r1.sp.html:130
r1.sp.norm:151
It seems that including contents of html tags makes a difference to the
distribution of scores. We need to shift the spam cutoff, so as to get
roughly the same numbers of false positives; then we can compare the
false-negative counts fairly. The default spam cutoff was 0.65, so the
classification with html tag contents was repeated with cutoff 0.75:
for num in 0 1; do
bogofilter -d htmldb -Ht -o 0.75 -Mv <../smindev/r$num.ns >r$num.ns.html
bogofilter -d htmldb -Ht -o 0.75 -Mv <../smindev/r$num.sp >r$num.sp.html
done
This did give comparable false-positive counts:
grep -c '^1' *.ns.*
r0.ns.html:4
r0.ns.norm:4
r1.ns.html:7
r1.ns.norm:6
In these circumstances the false-negative counts were similar too:
grep -c -v '^1' *.sp.*
r0.sp.html:135
r0.sp.norm:140
r1.sp.html:148
r1.sp.norm:151
Including contents of html tags did not significantly improve
discrimination when the shift in distribution is taken into account; R
was used to run an analysis of variance suggesting that the difference
is probably insignificant statistically, as well as practically:
html <- data.frame(html=c("n","n","y","y"),
run=c(0,1,0,1), fp=c(4,6,4,7), fn=c(140,151,135,148))
html$ns <- c(13365, 13366, 13365, 13366)
html$sp <- c(4732, 4730, 4732, 4730)
html$pc <- 100*html$fp/html$ns + 100*html$fn/html$sp
print(html,digits=3)
html run fp fn ns sp pc
1 n 0 4 140 13365 4732 2.99
2 n 1 6 151 13366 4730 3.24
3 y 0 4 135 13365 4732 2.88
4 y 1 7 148 13366 4730 3.18
html$html <- factor(html$html)
html$run <- factor(html$run)
summary(aov(pc ~ html+run, data=html))
Df Sum Sq Mean Sq F value Pr(>F)
html 1 0.006529 0.006529 10.565 0.19001
run 1 0.074874 0.074874 121.149 0.05768 .
Residuals 1 0.000618 0.000618
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list