Degeneration thought

David Relson relson at osagesoftware.com
Thu Jun 5 17:26:25 CEST 2003


At 11:10 AM 6/5/03, Peter Bishop wrote:
>On 5 Jun 2003 at 10:26, David Relson wrote:
>
> > >a single count (implicitly the other two unspecified counts are zero).
> >
> > The difficulty with "up to" is that the wordlists already have a timestamp
> > that may (or may not) be present.
>
>Oh yes, I forgot about that.
>
>However we could put the timestamp immediately after the token, then have
>the counts as the final fields
>
>Should be OK as a token in the new database format will always have a
>timestamp (is that right?).

No.  Life is never simple.  Since timestamps are of form YYYYMMDD and 
bogofilter came into the world on 20020820 (or so), bogofilter _could_ 
check the last number and decide if it's a timestamp or not.

>[snip]
>
> > >The idea here is that these weird forms are fairly rare so they can be
> > >stored separately with no great storage cost - degeneration is to one of
> > >the standard forms only - not to any of the other weird formats.
> >
> > Time will tell if the weird formats are rare, or not.  Certainly one lookup
> > for the three standard forms is a winning speed strategy.
>
>I am not sure how valid this is, but we could have a fourth count for *all*
>weird formats. So tHe, tHE, ThE etc all count as weird.
>This might result in higher pspam values as all wierd forms are counted as
>one form, but this only happens if the token has several different weird
>formats (rare event?).

That's comparable to the "Anycase*foo" token of which PG writes.

>If there are no weird formats, the fourth count need not be stored (if you
>can do right truncation of count fields), so it might be quite space
>efficient.
>
>Iinformation is lost if we do this, as we cannot regenerate the actual
>weird tokens and their counts (e.g. via bogoutil) e.g. we have output the
>token in a specific wierd format, like "tHe" rather than the actual one
>- but maybe it does not matter too much - after all we lost information
>about formats in the original casefolded database,
>





More information about the Bogofilter mailing list