Including html-tag contents may be unnecessary

Sun May 11 19:46:07 CEST 2003

An experiment was performed to determine whether including contents of
html tags (supported in bogofilter 0.12.3) improves discrimination:

The machine on which this experiment was run had, in a directory
referred to below as ../smindev, files t.sp and t.ns (for training,
containing spams and nonspams respectively), and r0.sp, r1.sp, r0.ns
and r1.ns (for testing).

Training was performed, without and with html tag contents:

mkdir normdb htmldb

bogofilter -d normdb -sv <../smindev/t.sp
# 3494475 words, 9462 messages

bogofilter -d normdb -nv <../smindev/t.ns
# 10766650 words, 26732 messages

bogofilter -d htmldb -Ht -sv <../smindev/t.sp
# 5427876 words, 9462 messages

bogofilter -d htmldb -Ht -nv <../smindev/t.ns
# 10805680 words, 26732 messages

As expected, there were a lot more tokens when html tag contents were
extracted from spam messages, and a proportionately smaller increase
among the nonspams.  Next the test files were classified:

for num in 0 1; do
  bogofilter -d normdb -Mv <../smindev/r$num.ns >r$num.ns.norm
  bogofilter -d normdb -Mv <../smindev/r$num.sp >r$num.sp.norm
  bogofilter -d htmldb -Ht -Mv <../smindev/r$num.ns >r$num.ns.html
  bogofilter -d htmldb -Ht -Mv <../smindev/r$num.sp >r$num.sp.html
done

These are the numbers of messages in the r?.ns and r?.sp files:

wc -l *.norm
  13365 r0.ns.norm
   4732 r0.sp.norm
  13366 r1.ns.norm
   4730 r1.sp.norm

There were some false positives in the .ns files:

grep -c '^1' *.ns.*
r0.ns.html:5
r0.ns.norm:4
r1.ns.html:9
r1.ns.norm:6

And of course there were some false negatives:

grep -c -v '^1' *.sp.*
r0.sp.html:124
r0.sp.norm:140
r1.sp.html:130
r1.sp.norm:151

It seems that including contents of html tags makes a difference to the
distribution of scores.  We need to shift the spam cutoff, so as to get
roughly the same numbers of false positives; then we can compare the
false-negative counts fairly.  The default spam cutoff was 0.65, so the
classification with html tag contents was repeated with cutoff 0.75:

for num in 0 1; do
  bogofilter -d htmldb -Ht -o 0.75 -Mv <../smindev/r$num.ns >r$num.ns.html
  bogofilter -d htmldb -Ht -o 0.75 -Mv <../smindev/r$num.sp >r$num.sp.html
done

This did give comparable false-positive counts:

grep -c '^1' *.ns.*
r0.ns.html:4
r0.ns.norm:4
r1.ns.html:7
r1.ns.norm:6

In these circumstances the false-negative counts were similar too:

grep -c -v '^1' *.sp.*
r0.sp.html:135
r0.sp.norm:140
r1.sp.html:148
r1.sp.norm:151

Including contents of html tags did not significantly improve
discrimination when the shift in distribution is taken into account; R
was used to run an analysis of variance suggesting that the difference
is probably insignificant statistically, as well as practically:

html <- data.frame(html=c("n","n","y","y"),
  run=c(0,1,0,1), fp=c(4,6,4,7), fn=c(140,151,135,148))
html$ns <- c(13365, 13366, 13365, 13366)
html$sp <- c(4732, 4730, 4732, 4730)
html$pc <- 100*html$fp/html$ns + 100*html$fn/html$sp

print(html,digits=3)
  html run fp  fn    ns   sp   pc
1    n   0  4 140 13365 4732 2.99
2    n   1  6 151 13366 4730 3.24
3    y   0  4 135 13365 4732 2.88
4    y   1  7 148 13366 4730 3.18

html$html <- factor(html$html)
html$run <- factor(html$run)
summary(aov(pc ~ html+run, data=html))
            Df   Sum Sq  Mean Sq F value  Pr(>F)  
html         1 0.006529 0.006529  10.565 0.19001  
run          1 0.074874 0.074874 121.149 0.05768 .
Residuals    1 0.000618 0.000618                  

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |