New new script to train bogofilter

Fri Jul 4 13:06:42 CEST 2003

On 20030704 (Fri) at 1237:53 +0200, Boris 'pi' Piwinger wrote:

but I'm going to reply to the list because there will be others
interested in the answer; hope you don't mind.

> .MSG_COUNT              329    283
> % of read               2.2    1.3
> 
> It turns out, that not a single message was used twice in
> training (by accident, but it might cool down worries;-).
> 
> I am curious to hear why this is not great and why I should
> see problems soon.
> 

Ok, I'll give it a quick shot, but it's an abstruse corner of
statistics and I may not be able to explain it both briefly and
convincingly (gotta go for the former, still lots of catchup to do at
work).  If anyone can add clarification, that would be great:

The Bayes rule assumes two things that aren't true in "Bayesian" spam
filtering.  One is that the population is uniform; that is, that all
tokens found in spam will have an equal chance of turning up in a given
spam, and all tokens found in nonspam will have an equal chance of
turning up in a given nonspam.  This is obviously untrue, but
fortunately it seems that the distribution of tokens varies from
uniformity in a way that actually helps the decision-making process. 
Gary Robinson, at least, has made such a claim, but I've never seen a
detailed explanation of why he thinks so.  At any rate, distorting the
counts by pretending a given spam is more prevalent than is the case is
likely to render that particular spam more recognizable, but also to
make spams that are like, but not strongly like, that particular one
less recognizable.

The other assumption we violate is that all tokens are independent of
one another.  In fact, the nature of language is such that the
occurrence of a given token makes it likely that certain other tokens
will appear.  These interrelations can be coped with in the analysis,
but again, overtraining on a given spam or nonspam will distort the
overall picture of the population that the Bayesian analysis is
building, so that messages similar to the overtrained one become easier
to classify, but at the expense of those that are not so similar.

So the problem you may see down the road if you train "to extinction"
on errors is that bogofilter will get very good at recognizing the
types of messages on which you train, and rather poor at recognizing
messages that are similar to, but not strongly similar to, those
training ones.  It's like trying to recognize dogs by training on
German shepherds only: a great Dane shares the "dog" characteristics,
but we'll have learned too many specific "German shepherd" ones so we
may well misclassify the great Dane.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |