Degeneration thought
David Relson
relson at osagesoftware.com
Thu Jun 5 14:04:34 CEST 2003
Peter,
An interesting idea. Let me summarize to see if I have it right.
For a token, there are three common forms - all lower case, all upper case,
and the standard capitalized form. Rather than store all three of these
forms in the wordlist, store the lower case form with 3 counts.
For the uncommon forms, store the exact token.
As examples, consider the eight forms of "the", i.e. THE, THe, ThE, tHE,
The, tHe, thE, and the. The wordlist would contain entry "the" with 3
counts (for "the", "The", and "THE", respectively), plus additional entries
for any of the other 5 variants encountered.
An additional minor complication is the number of wordlists - one or two -
used by bogofilter. With a single wordlist, rather than 3 counts, there
would be 3 pairs of counts.
Have I got your idea right?
David
More information about the Bogofilter
mailing list