Is bogofilter Bayesian?

Tue Feb 10 13:37:31 CET 2004

Greg Louis wrote:

>> 1) Is bogofilter Bayesian? It certainly uses those methods
>> at some point, but not throughout the computation, in
>> particular the Fisher method (using the chi function) would
>> not be it. So he suggests to call such filters "adaptive" or
>> "statistical", where the latter includes Bayesian.
> 
> Strictly speaking, the "Bayesian" spam filters accept the assumptions
> on which Bayesian statistical methodology is based, even though email
> grossly violates those assumptions.  IIRC Gary at one point felt (I
> don't know if he still does) that the violations in question
> (principally non-independence of tokens within messages) might in fact
> help rather than hinder discrimination.

In the sense that we check how bad the assumption fails, it
is helpful.

> At any rate, the filters
> (bogofilter included) are as Bayesian as nut bread is nuts (I happen to
> have a loaf of nut bread in the bread machine at the moment, hence the
> comparison) -- the term derives from a component that's responsible for
> the principal flavour ;)

Nice argument. But what really gives the flavor here? Isn't
much based on other arguments (all the Fisher calculation)?

>> In his opinion the choice of messages for
>> training on error also does no harm to this concept, hence
>> the warning would be inappropriate.
> 
> I'd be surprised if he were to confirm that your interpretation of his
> opinion is accurate here. 

He pretty much says it on his blog (in direct reply to the
bogofilter FAQ text):

"Frankly, there are fundamental theoretical violations even
in mainstream filters such as those based on the
frequently-used "naive Bayes" approach or on my own work,
because there is a theoretical assumption of statistical
independence (not the same as the randomly chosen sample
issue) which is violated by most of these techniques. But it
was long ago experimentally shown that naive Bayes is
actually robust against such a lack of independence.
Eventually proofs were created to explain it, but they came
after-the-fact. Later, my own technique was experimentally
shown to be similarly robust (although I do have a technique
"in the lab" to make it a bit more robust against that
particular violation of the rules)."

There was even more direct wording in our mail discussion.

> The fact that we can't avoid violating the
> assumptions on which Bayesian classification is based is no reason
> deliberately to multiply such violation, nor does it offer any
> protection from worsening accuracy by such multiplication. 

Again, we want it to fail and find the degree of failure.

> However,
> Gary pointed out in his first paper that pristine Bayesian validity is
> unachievable, and you and I have agreed, on this list, with his
> principle that what works matters more than what's statistically pure.

That being the main point anyway.

> I prefer, however, not to (mis)lead people into thinking it doesn't
> matter what you train with, nor how many times you do so; I would be
> sorry to see the warning removed.

Well, the argument isn't correct. So a warning about some
unseen effect (AFAICS) without proper reasoning sounds funny.

Very interesting here is what Liudvikas Bukys added about
AdaBoost
(http://kiew.cs.uni-dortmund.de:8001/mlnet/instances/81d91e8d-dc15ed23e9)
which maybe gives a theoretical background for what I found
just by trying and from very unscientific ideas.

pi