Re casefolding

Greg Louis glouis at dynamicro.on.ca
Thu May 15 13:49:03 CEST 2003


On 20030515 (Thu) at 0828:12 +0100, Peter Bishop wrote:
> On 14 May 2003 at 21:23, David Relson wrote:
> 
> > His tests and mine agree that _not_ folding case is good and that tagging 
> > header line is good.  My tests show that parsing the innards of A, FONT, 
> > and IMG tags is _very_ good.  His tests indicate parsing those innards is 
> > moot or slightly bad.
> > 
> > Given these results, the defaults are going to be "-PfHt" for now.  If 
> > additional testing testing shows that 'T' is more often good than bad, this
> > is likely change.  (Hopefully I've written what I intended.  Today I've 
> > been confusing a lot of my 0's and 1's, yes's and no's, ...)
> > 

> At first glance this seems to be the order of improvement
> in false negatives, i.e. approx:
> 
> options		fn (%)
> none	:	1.08
> upper		0.88
> up + head	0.73
> head		0.68
> 
> So we get a 37% decrease in false negatives just with headers
> And an 18% decrease with upper case tokens
> 
> The difference between "head" and "up+head" is marginal, but
> I would imagine that processing headers would add fewer tokens
> to the database than allowing upper case tokens. 
> 
> So on database size grounds maybe there is an argument for
> just enabling headers

That slightly better result for header tagging with case folding
doesn't hold up in other experiments, where tagging plus folding
is less effective than tagging alone, by five or ten percent.

Here are some database sizes:

0fht:
total 19476
-rw-r--r--    1 root     root     10264576 May 14 12:24 goodlist.db
-rw-r--r--    1 root     root      9646080 May 14 12:24 spamlist.db
4Fht:
total 18424
-rw-r--r--    1 root     root      9695232 May 14 13:26 goodlist.db
-rw-r--r--    1 root     root      9138176 May 14 13:26 spamlist.db

Without folding, both lists grow by just under six percent.  If this is
consistent, it's not a major problem for most people, I should imagine.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list