Degeneration thought

Peter Bishop pgb at adelard.com
Thu Jun 5 09:08:36 CEST 2003


Re Paul Graham's degeneration suggestion, could we have database entries
that look like this:?

free 0 10 20
fRee 10
more 20
opportunity 0 50
oPPortunity 1

The idea here is:

1) We have a casefolded token for "standard" case options:
e.g.
free, Free, FREE
and the count fields in db line correspond to these case settings, e.g.
free 0 10 20
is equivalent to:
free 0
Free 10
FREE 20

2) If the case format is non standard e.g. fRee there is a separate entry, 
e.g.
fRee 10

3) To save space in the db, there is no need to include the final zero 
count fields
If the token only has a lower case format you only have one count,
e.g. 
more 20

while

opportunity 0 50

means there that token "Opportunity" occurs 50 times

With this database structure we degenerate as follows:

1) determine the case format (lower, firstcap, allcap, other)
2) If other - look for the precise token e.g. frEE
    - if you cannot find it, casefold the token to e.g. to "free"
    - if it exists use the max.(or average?)  of  non-zero counts in "free"
3) if lower, firstcap, allcap,
   - casefold the token
   - look up the count for the specific format
   - if the count is non-zero use it
   - otherwise use the max (or average?) count in "free"

I think this db format is a fairly efficient way of implementing 
degeneration.
 - for the standard case formats there is only one db lookup, 
 - for nonstandard formats thare are at most two lookups
 - with only lowercase tokens the database requires no more space than the 
old casefolded database
 - with all case formats for all tokens, the size only doubles

The only question is whether db3 allows a varable number of fields in db 
entries, but I think it does as I accidentally used a new bogofilter to 
update an old database (with no date field) and bogofilter and bogoutil 
seemed to work quite happily with the db even though some entries had a 
count + a date while others only had a count.


-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list