Way to go

David Relson relson at osagesoftware.com
Thu Jun 26 14:35:29 CEST 2003


At 02:55 AM 6/26/03, Boris 'pi' Piwinger wrote:
>Hi!
>
>So what is coming next? Bogofilter does a great job by now.
>What will happen until 1.0?
>
>The one big thing I can think about is the implementation of
>charsets, i.e., unicode translation.
>
>pi

Hi pi,

There are two things still in the pipeline - the combined wordlist version 
(using wordlist.db rather than spamlist.db and goodlist.db) and token 
degeneration code (looking for "Free" and "free" when "FRee" isn't 
matched).  It has also been suggested that instead of having entries for 
the three most common forms of a word, i.e. "word", "Word", and "WORD", 
that bogofilter have one entry with 3 counts (or pairs of counts).  For 
unusual capitalizations, e.g. "wOrD", "worD", etc, bogofilter could (a) 
store all such forms, (b) store only "mis:word" (for mis-capitalized), or 
(c) could have a 4th count.

At the moment, I'm working on the wordlist code so that it will work 
properly whether one or two lists are present.  Likely there will need to 
be an option so that new wordlists will be created as the user 
desires.  Chances are good that the code will be releasable within a week 
as bogofilter-0.14.

The degen code currently stores each word form and, in the worst case, will 
do 17 additional database lookups to find the extrema form (most spammish 
or most hammish).  Experiments need to be run to learn which of the 
different code variations will give best performance, i.e. speed and 
wordlist size.  It will take time to write the code and do the testing.

David






More information about the Bogofilter mailing list