Degeneration thought

David Relson relson at osagesoftware.com
Thu Jun 5 14:04:34 CEST 2003


Peter,

An interesting idea.  Let me summarize to see if I have it right.

For a token, there are three common forms - all lower case, all upper case, 
and the standard capitalized form.  Rather than store all three of these 
forms in the wordlist, store the lower case form with 3 counts.

For the uncommon forms, store the exact token.

As examples, consider the eight forms of "the", i.e. THE, THe, ThE, tHE, 
The, tHe, thE, and the.  The wordlist would contain entry "the" with 3 
counts (for "the", "The", and "THE", respectively), plus additional entries 
for any of the other 5 variants encountered.

An additional minor complication is the number of wordlists - one or two - 
used by bogofilter.  With a single wordlist, rather than 3 counts, there 
would be 3 pairs of counts.

Have I got your idea right?

David





More information about the Bogofilter mailing list