Degeneration thought

Thu Jun 5 16:26:59 CEST 2003

At 10:04 AM 6/5/03, Peter Bishop wrote:
>On 5 Jun 2003 at 8:04, David Relson wrote:
>
> > For a token, there are three common forms - all lower case, all upper case,
> > and the standard capitalized form.  Rather than store all three of these
> > forms in the wordlist, store the lower case form with 3 counts.
>
>Yes,
>
>Though to save space I suggest *up to* three counts (depending which forms
>have been found) so if only the lower case form has been seen you only need
>a single count (implicitly the other two unspecified counts are zero).

The difficulty with "up to" is that the wordlists already have a timestamp 
that may (or may not) be present.

> > For the uncommon forms, store the exact token.
> >
> > As examples, consider the eight forms of "the", i.e. THE, THe, ThE, tHE,
> > The, tHe, thE, and the.  The wordlist would contain entry "the" with 3
> > counts (for "the", "The", and "THE", respectively), plus additional entries
> > for any of the other 5 variants encountered.
>
>Yes,
>
>The idea here is that these weird forms are fairly rare so they can be
>stored separately with no great storage cost - degeneration is to one of
>the standard forms only - not to any of the other weird formats.

Time will tell if the weird formats are rare, or not.  Certainly one lookup 
for the three standard forms is a winning speed strategy.

> > An additional minor complication is the number of wordlists - one or two -
> > used by bogofilter.  With a single wordlist, rather than 3 counts, there
> > would be 3 pairs of counts.
>
>I think we can still have a goodlist and a spamlist database
>- degeneration might be different for the two lists depending on which
>tokens have been stored in the two lists
>e.g. fREE -> FREE               (spam list)
>        fREE -> free             (good list)
>But computation of pgood and pspam from individual format counts is a
>separate issue from the storage format - whatever you do you only need to
>look up the casefolded token in both databases
>-this provides all the relevant counts.
>
>The standard output format can be retained for dumping and restoring the
>wordlists
>
>e.g. free 0 10 20
>
>would dump as
>Free 10
>FREE 20
>
>I suppose a combined wordlist *could* be more efficient in time
>and maybe storage space (but this is not certain)

Greg and I have done some tests, and a combined wordlist _is_ 10 to 20% 
faster for finding tokens.  The difficulty is that, as database size 
increases, BerkeleyDB starts to _need_ a larger cache for good 
performance.  Also, some cache sizes can cause truly horrible performance - 
for reasons that we don't yet understand.

>And it might involve more code and user installation changes
>(ability to use old databases, converting databases, etc).

The code to change from two wordlists to one is nearly trivial.  Dump the 
two wordlists, use an awk script to add zero counts to each record, and 
load a new combined wordlist.  To minimize the size, the new wordlist needs 
to be dumped and reloaded (due to BerkeleyDB inefficiencies in the initial 
load).

David