Degeneration thought

Peter Bishop pgb at adelard.com
Thu Jun 5 16:04:19 CEST 2003


On 5 Jun 2003 at 8:04, David Relson wrote:

> For a token, there are three common forms - all lower case, all upper case,
> and the standard capitalized form.  Rather than store all three of these 
> forms in the wordlist, store the lower case form with 3 counts.

Yes,

Though to save space I suggest *up to* three counts (depending which forms 
have been found) so if only the lower case form has been seen you only need 
a single count (implicitly the other two unspecified counts are zero).

> For the uncommon forms, store the exact token.
> 
> As examples, consider the eight forms of "the", i.e. THE, THe, ThE, tHE, 
> The, tHe, thE, and the.  The wordlist would contain entry "the" with 3 
> counts (for "the", "The", and "THE", respectively), plus additional entries
> for any of the other 5 variants encountered.

Yes, 

The idea here is that these weird forms are fairly rare so they can be 
stored separately with no great storage cost - degeneration is to one of 
the standard forms only - not to any of the other weird formats.

> An additional minor complication is the number of wordlists - one or two -
> used by bogofilter.  With a single wordlist, rather than 3 counts, there 
> would be 3 pairs of counts.

I think we can still have a goodlist and a spamlist database
- degeneration might be different for the two lists depending on which 
tokens have been stored in the two lists
e.g. fREE -> FREE		 (spam list)
       fREE -> free		 (good list)
But computation of pgood and pspam from individual format counts is a 
separate issue from the storage format - whatever you do you only need to 
look up the casefolded token in both databases
-this provides all the relevant counts.

The standard output format can be retained for dumping and restoring the 
wordlists

e.g. free 0 10 20

would dump as
Free 10
FREE 20

I suppose a combined wordlist *could* be more efficient in time 
and maybe storage space (but this is not certain)

And it might involve more code and user installation changes
(ability to use old databases, converting databases, etc).


-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list