source file organization
David Relson
relson at osagesoftware.com
Fri Jan 3 21:17:21 CET 2003
At 03:06 PM 1/3/03, Jake Di Toro wrote:
>On Fri, 3 Jan 2003, David Relson wrote:
>
> > Regarding algorithms, I think tuning will continue for a long time. I've
> > been using Fisher since 0.9.1.2 was released and I'm very pleased with
> > it. I wouldn't mind making it the default.
>
>What will changing the default do to those of us who are just running on
>it. I haven't looked at the code well enough, but are the word databases
>indifferent to the algorithms or tied?
Hello Jake,
Currently the default algorithm is Robinson and the suggested change is to
Robinson-Fisher. Robinson and Robinson-Fisher do the same basic
calculation to compute the spamicity score. Robinson-Fisher adds a
chi-square test as the end to generate a three state result
(ham/spam/unsure). They treat the wordlists in the same way.
The original algorithm used in bogofilter, the Graham algorithm, operated
slightly differently. When collecting the words from a message, each word
was allowed up to four points (if it occurred 4 or more times in the
message). These "points" were added to the counts in the wordlist when the
wordlist is updated. For R/RF, a word can only get 1 point per message.
In actual usage, one can change algorithms without changing the
wordlists. I personally used Graham for a month or so, then switched to
Robinson for a month or so, and have been using Robinson-Fisher for the
past month. It's only been this past week that I've taken the accumulated
messages of the 3 months and (using the randomtrain script) built fresh,
new wordlists. FWIW, the new wordlists are 10% to 20% of the size of the
old ones and are doing quite well. The number of "unsures" seems to be
higher, but I expect that to level out over the next few weeks.
David
More information about the Bogofilter
mailing list