token degeneration

Greg Louis glouis at dynamicro.on.ca
Sat May 31 13:08:10 CEST 2003


On 20030530 (Fri) at 2134:47 -0400, David Relson wrote:

> Training can be complicated when Anycase is supported.  Ideally 
> Anycase*token only exists if there are two variants of token in the 
> wordlist.

I think of storing a key-linked list (not a pointer-linked one); I
don't see this as terribly hard, given that the list grows and does not
shrink (except during maintenance with bogoutil).  We'd store a token
case-folded, plus any variants that are encountered.  Perform the db
lookup on the case-folded key and walk the list to find the best match. 
For the large number of cases where there is only a lower-case token,
this costs MAXTOKENLENGTH unused bytes per token.  For the less large
number of cases with one mixed-case token, we've added one level of
indirection that isn't strictly necessary, but I'm shy of the
complexity of the workaround.  For the rest, we've done what's needed.
As long as the average number of db lookups remains small, the impact
on throughput may be tolerable; the impact on the database size might
be quite painful, though.  (We could teach bogoutil to rebuild lists in
decreasing order of total count, so that more common variants need
fewer searches; running that from time to time would minimize the
performance hit.)  The benefit of all this would have to be rather high
to justify the complexity.

> I interpret as Anywhere*foo as the easy fallback when there's no exact 
> match.  It will be retrieved whenever a new case-sensitive variant is 
> encountered.

Right.  We seem to be in agreement on this.

> So, the conclusion seems to be "implement the bells and whistles, test, 
> throw away what's not useful."
> 
> Before that happens, I want to be clear on what we're going to test.

Sounds eminently reasonable :)

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list