Degeneration thought

Thu Jun 5 17:10:32 CEST 2003

On 5 Jun 2003 at 10:26, David Relson wrote:

> >a single count (implicitly the other two unspecified counts are zero).
> 
> The difficulty with "up to" is that the wordlists already have a timestamp
> that may (or may not) be present.

Oh yes, I forgot about that.

However we could put the timestamp immediately after the token, then have 
the counts as the final fields

Should be OK as a token in the new database format will always have a 
timestamp (is that right?).

[snip]

> >The idea here is that these weird forms are fairly rare so they can be
> >stored separately with no great storage cost - degeneration is to one of
> >the standard forms only - not to any of the other weird formats.
> 
> Time will tell if the weird formats are rare, or not.  Certainly one lookup
> for the three standard forms is a winning speed strategy.

I am not sure how valid this is, but we could have a fourth count for *all* 
weird formats. So tHe, tHE, ThE etc all count as weird.
This might result in higher pspam values as all wierd forms are counted as 
one form, but this only happens if the token has several different weird 
formats (rare event?).

If there are no weird formats, the fourth count need not be stored (if you 
can do right truncation of count fields), so it might be quite space 
efficient.

Iinformation is lost if we do this, as we cannot regenerate the actual 
weird tokens and their counts (e.g. via bogoutil) e.g. we have output the 
token in a specific wierd format, like "tHe" rather than the actual one
- but maybe it does not matter too much - after all we lost information 
about formats in the original casefolded database,

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk