header tagging helps a bit
Greg Louis
glouis at dynamicro.on.ca
Sun Sep 28 01:27:15 CEST 2003
Hi, everybody:
Current bogofilter, if told to do so, tags tokens in From:, To:, and a
few other headers specifically. Version 0.15.4 also tags the remaining
headers with the generic prefix head: and David and I have been working
on evaluating (1) whether this is worth its associated overhead ("le
calembour, c'est la fiente de l'esprit qui vole" -- the pun's the
dropping of the flying spirit -- Victor Hugo), and (2) how to put it
into a production system without rebuilding the training db.
Executive summary: (1) probably; (2) start tagging now but don't
actually use the tags till there are enough of them. In the meantime,
evaluate header tokens by combining tagged and untagged token counts.
The description that follows can actually be read into, and executed
within, an R session:
---8<--
# Three training db were produced from a corpus of 22283 spam and 18549
# nonspam. The db were built in directories named all, half and none,
# the names reflecting the portion of the messages that were registered
# with head: tagging.
#
# A second corpus of 16273 spam and 9659 nonspam was divided into three
# files containing spam and three containing nonspam, in the usual way.
# These were classified with what has been called "degeneration"
# (meaning, in this case, that header tokens were evaluated by combining
# tagged and untagged token counts) and without it (header tokens were
# evaluated with head: counts only). The classification was performed
# once with each of the three training databases; in factorial terms, we
# have
#
# training database: all, half, none
# degeneration: yes, no
# replication: 0, 1, 2
#
# Because any change of this kind alters the distribution of spam scores,
# it's important to take both fp and fn into account; just to keep things
# simple, I used %fn + %fp as the error variable. It's ok to do that if
# the variation in fp within a given run is small, as is generally the
# case in the present experiment. The raw numbers are given in the
# following table:
#
# database --> all half none
# ----------------------
# degeneration fn fp fn fp fn fp
# |
# v 0 31 172 28 178 33 190
# yes 1 39 179 37 184 41 183
# 2 34 172 32 172 38 176
#
# 0 33 171 31 174 49 185
# no 1 41 176 40 183 51 178
# 2 36 175 34 177 53 176
#
# the means are:
# all/degen 4.29
# all/tag 4.35
# half/degen 4.28
# half/tag 4.37
# none/degen 4.53
# none/tag 4.90
hdrtag <- read.table("/root/hdrtag.tbl")
attach(hdrtag)
hdrtag$percent <- fn*100/(9659/3) + fp*100/(16273/3)
# print(hdrtag, digits=3)
# db degen run fn fp percent
# 1 all yes 0 31 172 4.13
# 2 all yes 1 39 179 4.51
# 3 all yes 2 34 172 4.23
# 4 all no 0 33 171 4.18
# 5 all no 1 41 176 4.52
# 6 all no 2 36 175 4.34
# 7 half yes 0 27 178 4.12
# 8 half yes 1 37 184 4.54
# 9 half yes 2 32 172 4.16
# 10 half no 0 31 174 4.17
# 11 half no 1 40 183 4.62
# 12 half no 2 34 177 4.32
# 13 none yes 0 33 190 4.53
# 14 none yes 1 41 183 4.65
# 15 none yes 2 38 176 4.42
# 16 none no 0 49 185 4.93
# 17 none no 1 51 178 4.87
# 18 none no 2 53 176 4.89
hdrtag$run <- factor(hdrtag$run)
summary(aov(percent ~ db + degen + run, data=hdrtag))
# Df Sum Sq Mean Sq F value Pr(>F)
# db 2 0.62235 0.31118 18.2774 0.0002279 ***
# degen 1 0.13116 0.13116 7.7039 0.0167870 *
# run 2 0.25229 0.12614 7.4092 0.0080257 **
# Residuals 12 0.20430 0.01703
# We conclude:
# 1. Tagging makes a small but significant improvement in the spam
# scores.
# 2. As expected, the benefit of degeneration vanishes as the training
# database acquires significant counts.
# It follows that bogoadmins should implement tagging-on-registration
# immediately and tagging-on-classification once the number of tagged
# messages in the training database has become adequate to support it.
---8<--
--
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
Header information for this message:
Subject: header tagging helps a bit
To: bogofilter <bogofilter at aotto.com>
From: Greg Louis <glouis at dynamicro.on.ca>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 211 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030927/1a9322cc/attachment.sig>
More information about the Bogofilter
mailing list