Degeneration thought
Peter Bishop
pgb at adelard.com
Thu Jun 5 17:10:32 CEST 2003
On 5 Jun 2003 at 10:26, David Relson wrote:
> >a single count (implicitly the other two unspecified counts are zero).
>
> The difficulty with "up to" is that the wordlists already have a timestamp
> that may (or may not) be present.
Oh yes, I forgot about that.
However we could put the timestamp immediately after the token, then have
the counts as the final fields
Should be OK as a token in the new database format will always have a
timestamp (is that right?).
[snip]
> >The idea here is that these weird forms are fairly rare so they can be
> >stored separately with no great storage cost - degeneration is to one of
> >the standard forms only - not to any of the other weird formats.
>
> Time will tell if the weird formats are rare, or not. Certainly one lookup
> for the three standard forms is a winning speed strategy.
I am not sure how valid this is, but we could have a fourth count for *all*
weird formats. So tHe, tHE, ThE etc all count as weird.
This might result in higher pspam values as all wierd forms are counted as
one form, but this only happens if the token has several different weird
formats (rare event?).
If there are no weird formats, the fourth count need not be stored (if you
can do right truncation of count fields), so it might be quite space
efficient.
Iinformation is lost if we do this, as we cannot regenerate the actual
weird tokens and their counts (e.g. via bogoutil) e.g. we have output the
token in a specific wierd format, like "tHe" rather than the actual one
- but maybe it does not matter too much - after all we lost information
about formats in the original casefolded database,
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list