token degeneration

Wed Jun 4 22:21:54 CEST 2003

Greg Louis <glouis at dynamicro.on.ca> writes:

> On 20030530 (Fri) at 2134:47 -0400, David Relson wrote:
>
>> Training can be complicated when Anycase is supported.  Ideally 
>> Anycase*token only exists if there are two variants of token in the 
>> wordlist.
>
> I think of storing a key-linked list (not a pointer-linked one); I

BerkeleyDB also support storing multiple tokens with the same name. I
haven't yet figured if it allows to place our own comparison function,
but I think it should.

> As long as the average number of db lookups remains small, the impact
> on throughput may be tolerable; the impact on the database size might
> be quite painful, though.  (We could teach bogoutil to rebuild lists in
> decreasing order of total count, so that more common variants need
> fewer searches; running that from time to time would minimize the
> performance hit.)  The benefit of all this would have to be rather high
> to justify the complexity.

Isn't all this ultimately about similarity "match"? For any value of
"similarity", of course, but looking at phonetic search or "looks
similarly l33tsp33ch" searches this might be the way to go.

-- 
Matthias Andree