source file organization

Fri Jan 3 21:17:21 CET 2003

At 03:06 PM 1/3/03, Jake Di Toro wrote:

>On Fri, 3 Jan 2003, David Relson wrote:
>
> > Regarding algorithms, I think tuning will continue for a long time.  I've
> > been using Fisher since 0.9.1.2 was released and I'm very pleased with
> > it.  I wouldn't mind making it the default.
>
>What will changing the default do to those of us who are just running on
>it.  I haven't looked at the code well enough, but are the word databases
>indifferent to the algorithms or tied?

Hello Jake,

Currently the default algorithm is Robinson and the suggested change is to 
Robinson-Fisher.  Robinson and Robinson-Fisher do the same basic 
calculation to compute the spamicity score.  Robinson-Fisher adds a 
chi-square test as the end to generate a three state result 
(ham/spam/unsure).  They treat the wordlists in the same way.

The original algorithm used in bogofilter, the Graham algorithm, operated 
slightly differently.  When collecting the words from a message, each word 
was allowed up to four points (if it occurred 4 or more times in the 
message).  These "points" were added to the counts in the wordlist when the 
wordlist is updated.  For R/RF, a word can only get 1 point per message.

In actual usage, one can change algorithms without changing the 
wordlists.  I personally used Graham for a month or so, then switched to 
Robinson for a month or so, and have been using Robinson-Fisher for the 
past month.  It's only been this past week that I've taken the accumulated 
messages of the 3 months and (using the randomtrain script) built fresh, 
new wordlists.  FWIW, the new wordlists are 10% to 20% of the size of the 
old ones and are doing quite well.  The number of "unsures" seems to be 
higher, but I expect that to level out over the next few weeks.
David