Re casefolding
Peter Bishop
pgb at adelard.com
Thu May 15 09:28:12 CEST 2003
On 14 May 2003 at 21:23, David Relson wrote:
> His tests and mine agree that _not_ folding case is good and that tagging
> header line is good. My tests show that parsing the innards of A, FONT,
> and IMG tags is _very_ good. His tests indicate parsing those innards is
> moot or slightly bad.
>
> Given these results, the defaults are going to be "-PfHt" for now. If
> additional testing testing shows that 'T' is more often good than bad, this
> is likely change. (Hopefully I've written what I intended. Today I've
> been confusing a lot of my 0's and 1's, yes's and no's, ...)
>
I had some trouble too.
H, T, and f are new features
while
h t and F represent the previous bogofilter state
Also a lower case f means you can have upper case
and uppercase F mean you can only have lower case
It would have been less confusing to replace f by U
i.e.
U uppercase tokens enabled
u uppercase tokens disabled
Then the new features would be H,T,U
and the old features are h,t,u
Anyway recasting the Greg Louis results in that form
we get:
Current settings:
4uht0 u h t 0 0.503836 3 51 4732 1.0777684
4uht1 u h t 1 0.503836 3 52 4730 1.0993658
Upper case tokens
0Uht0 U h t 0 0.500686 3 41 4732 0.8664413
0Uht1 U h t 1 0.500686 3 42 4730 0.8879493
Upper case + Headers
2UHt0 U H t 0 0.500021 3 32 4732 0.6762468
2UHt1 U H t 1 0.500021 3 37 4730 0.7822410
Headers
6uHt0 u H t 0 0.500011 3 31 4732 0.6551141
6uHt1 u H t 1 0.500011 3 33 4730 0.6976744
At first glance this seems to be the order of improvement
in false negatives, i.e. approx:
options fn (%)
none : 1.08
upper 0.88
up + head 0.73
head 0.68
So we get a 37% decrease in false negatives just with headers
And an 18% decrease with upper case tokens
The difference between "head" and "up+head" is marginal, but
I would imagine that processing headers would add fewer tokens
to the database than allowing upper case tokens.
So on database size grounds maybe there is an argument for
just enabling headers
--
Peter Bishop
Adelard and Centre for Software Reliability, City University
Drysdale Building, 10 Northampton Square, London, EC1V 0HB
Tel: +44-20-7490-9467, Fax: +44-20-7490-9451
pgb at adelard.com, http://www.adelard.com/
pgb at csr.city.ac.uk, http://www.city.ac.uk/
More information about the Bogofilter
mailing list