Re casefolding

Thu May 15 09:28:12 CEST 2003

On 14 May 2003 at 21:23, David Relson wrote:

> His tests and mine agree that _not_ folding case is good and that tagging 
> header line is good.  My tests show that parsing the innards of A, FONT, 
> and IMG tags is _very_ good.  His tests indicate parsing those innards is 
> moot or slightly bad.
> 
> Given these results, the defaults are going to be "-PfHt" for now.  If 
> additional testing testing shows that 'T' is more often good than bad, this
> is likely change.  (Hopefully I've written what I intended.  Today I've 
> been confusing a lot of my 0's and 1's, yes's and no's, ...)
> 

I had some trouble too.
H, T, and f are new features 
while
h t and F represent the previous bogofilter state

Also a lower case f means you can have upper case
and uppercase F mean you can only have lower case

It would have been less confusing to replace f by U
i.e.
U uppercase tokens enabled
u uppercase tokens disabled

Then the new features would be H,T,U
and the old features are h,t,u

Anyway recasting the Greg Louis results in that form
we get:

Current settings:
4uht0    u    h    t   0 0.503836  3 51 4732 1.0777684
4uht1    u    h    t   1 0.503836  3 52 4730 1.0993658

Upper case tokens
0Uht0    U    h    t   0 0.500686  3 41 4732 0.8664413
0Uht1    U    h    t   1 0.500686  3 42 4730 0.8879493

Upper case + Headers
2UHt0    U    H    t   0 0.500021  3 32 4732 0.6762468
2UHt1    U    H    t   1 0.500021  3 37 4730 0.7822410

Headers
6uHt0    u    H    t   0 0.500011  3 31 4732 0.6551141
6uHt1    u    H    t   1 0.500011  3 33 4730 0.6976744

At first glance this seems to be the order of improvement
in false negatives, i.e. approx:

options		fn (%)
none	:	1.08
upper		0.88
up + head	0.73
head		0.68

So we get a 37% decrease in false negatives just with headers
And an 18% decrease with upper case tokens

The difference between "head" and "up+head" is marginal, but
I would imagine that processing headers would add fewer tokens
to the database than allowing upper case tokens. 

So on database size grounds maybe there is an argument for
just enabling headers

-- 
Peter Bishop 
Adelard and Centre for Software Reliability, City University
Drysdale Building, 10 Northampton Square, London, EC1V 0HB
Tel: +44-20-7490-9467, Fax: +44-20-7490-9451
pgb at adelard.com, http://www.adelard.com/
pgb at csr.city.ac.uk, http://www.city.ac.uk/