Degeneration thought
David Relson
relson at osagesoftware.com
Thu Jun 5 17:26:25 CEST 2003
At 11:10 AM 6/5/03, Peter Bishop wrote:
>On 5 Jun 2003 at 10:26, David Relson wrote:
>
> > >a single count (implicitly the other two unspecified counts are zero).
> >
> > The difficulty with "up to" is that the wordlists already have a timestamp
> > that may (or may not) be present.
>
>Oh yes, I forgot about that.
>
>However we could put the timestamp immediately after the token, then have
>the counts as the final fields
>
>Should be OK as a token in the new database format will always have a
>timestamp (is that right?).
No. Life is never simple. Since timestamps are of form YYYYMMDD and
bogofilter came into the world on 20020820 (or so), bogofilter _could_
check the last number and decide if it's a timestamp or not.
>[snip]
>
> > >The idea here is that these weird forms are fairly rare so they can be
> > >stored separately with no great storage cost - degeneration is to one of
> > >the standard forms only - not to any of the other weird formats.
> >
> > Time will tell if the weird formats are rare, or not. Certainly one lookup
> > for the three standard forms is a winning speed strategy.
>
>I am not sure how valid this is, but we could have a fourth count for *all*
>weird formats. So tHe, tHE, ThE etc all count as weird.
>This might result in higher pspam values as all wierd forms are counted as
>one form, but this only happens if the token has several different weird
>formats (rare event?).
That's comparable to the "Anycase*foo" token of which PG writes.
>If there are no weird formats, the fourth count need not be stored (if you
>can do right truncation of count fields), so it might be quite space
>efficient.
>
>Iinformation is lost if we do this, as we cannot regenerate the actual
>weird tokens and their counts (e.g. via bogoutil) e.g. we have output the
>token in a specific wierd format, like "tHe" rather than the actual one
>- but maybe it does not matter too much - after all we lost information
>about formats in the original casefolded database,
>
More information about the Bogofilter
mailing list