Paul Graham's suggested refinements: recommend adoption as defaults

Sun May 18 17:05:25 CEST 2003

David and I have been testing the following three refinements to
bogofilter's message classification process, all suggested by Paul
Graham (see http://paulgraham.com/better.html):

1.  Preserve case in tokens.
2.  Tag tokens from To:, From:, Subject: and Return-Path: headers to
    distinguish them from tokens found in the body.
3.  Extract tokens from html A, IMG and FONT tags (Paul says these
    are the useful ones).

Four separate experiments, all with the same factorial design but each
involving a different message corpus, have been reported on my
bogofilter web site at http://www.bgl.nu/bogofilter/graham.html

To quote the conclusion of that report:

Each of Paul Graham's three suggestions (preserve case, tag headers,
use contents of A, IMG and FONT tags) was beneficial, although in some
cases not strongly so, in each of the four experiments performed.  It
would seem reasonable to make case preservation, header tagging and
html tag content extraction the defaults for future bogofilter
versions.

(end quote)

For anyone worried about database size, I'll just mention that with all
three modifications, my training db grew by 20%.  Most of this was due
to extra tokens associated with header tags; about a third was due to
case preservation, and roughly an eighth to html tag contents (YMMV of
course).

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |