Is bogofilter Bayesian?

Greg Louis glouis at dynamicro.on.ca
Tue Feb 10 18:31:43 CET 2004


On 20040210 (Tue) at 1727:27 +0100, Boris 'pi' Piwinger wrote:

> Let me rephrase: We abuse the theory right from the
> beginning (one might argue that this might in fact
> help rather than hinder discrimination). The wording in the
> FAQ suggests that we don't and only training on error adds a
> (theoretical) flaw. This gives the wrong impression. Any
> better wording is welcome.

Using Bayesian classification for email in the way we do, with full
training, violates two assumptions on which Bayesian classification is
based, namely, Bayesian classification would expect independence of
tokens within messages and uniform distribution of scores.  In
discussing training on error and training to exhaustion, we don't
mention that.  We mention that training on error violates a (third)
assumption, and we mention that training to exhaustion violates that
and a fourth assumption.  So they do.  It would probably be a good idea
if the first two assumptions that are violated anyway were mentioned
somewhere, but to conclude that it's ok to violate more assumptions
because you can't be pure anyway doesn't make much sense to me.

I'm running a 30,000-of-each exhaustion test using my home corpus, and
assembling a 55,000-of-each corpus at work where the email population
is much more diverse.  No results to report as yet.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list