token degeneration

Wed Jun 4 22:34:12 CEST 2003

At 04:21 PM 6/4/03, Matthias Andree wrote:
>Greg Louis <glouis at dynamicro.on.ca> writes:
>
> > On 20030530 (Fri) at 2134:47 -0400, David Relson wrote:
> >
> >> Training can be complicated when Anycase is supported.  Ideally
> >> Anycase*token only exists if there are two variants of token in the
> >> wordlist.
> >
> > I think of storing a key-linked list (not a pointer-linked one); I
>
>BerkeleyDB also support storing multiple tokens with the same name. I
>haven't yet figured if it allows to place our own comparison function,
>but I think it should.
>
> > As long as the average number of db lookups remains small, the impact
> > on throughput may be tolerable; the impact on the database size might
> > be quite painful, though.  (We could teach bogoutil to rebuild lists in
> > decreasing order of total count, so that more common variants need
> > fewer searches; running that from time to time would minimize the
> > performance hit.)  The benefit of all this would have to be rather high
> > to justify the complexity.
>
>Isn't all this ultimately about similarity "match"? For any value of
>"similarity", of course, but looking at phonetic search or "looks
>similarly l33tsp33ch" searches this might be the way to go.

Matthias,

I have code that implements Paul Grahams degeneration algorithm.  It's 
working, though I have to add a flag so it doesn't repeat the initial 
(failed) lookup (just haven't gotten around to it :-)

I haven't added it to bogofilter because of the impending release.  I can 
make the patch publicly or privately available, as you wish.

David