testing case-folding [was: Markup.]

Peter Bishop pgb at adelard.com
Mon May 12 10:09:33 CEST 2003


I attempted to evaluate case folding using a kludge
where I converted the emails so that upper case text
was changed, e.g.

WORD to ZZ.WORD

And you get extra tokens when the email is registered

i.e. 
zz.word    (if upper case)
word        (if not all upper case)

Tests performed on bogofilter 0.9
1907 Training spam (3.gz from spamarchive.org)
1026 Training ham (some of my own emails )

3078 Test spam (2.gz from spamarchive.org)

Results 
case-folded        19 false negatives
with UC words    18 fallse negatives

no false positives with either

I am not too sure about these results though, as the total number
of messages processed (using formail to split the mailbox file)
differs for my kludged emails compared to the originals. 

The tests might be untypical too as the proportion of false negatives is 
rather low at 0.6% (with no tuning)

The number of tokens in the database increases by about 13%
for both spam and nonspam databases.

Basically there seems to be little obvious benefit for the increase in 
database size.

But as I said this is all a bit of a kludge, I would be happier if someone 
did a proper trial (Greg Louis seems to got it down to a fine art).
-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list