the importance of robx

Sun Feb 29 19:37:26 CET 2004

On 20040229 (Sun) at 1115:32 -0500, Tom Anderson wrote:
> On Sat, 2004-02-28 at 20:04, Greg Louis wrote:
> > >  My robx is 0.48 and my min_dev is 0.2. 
> > > This means that hapaxes will have no effect on your classifications.
> > 
> > I think you mean unknowns.  If a token has been seen exactly once
> > before, it will have quite a strong influence that will be diluted by x
> > to the degree specified by the s value.  Most of us use quite small s
> > values so our hapaxes count heavily in classification.  I once removed
> > all hapaxes from my training db to see what would happen, and
> > bogofilter's accuracy worsened by an order of magnitude!
> 
> Well, I use a large robs because I don't want new words counting very
> much.  If a word has been registered just once before or if it is the
> first time, then it remains within min_dev, and doesn't count towards
> classifications at all.

Nothing wrong with that, if it works for you.

> An explanation of your hypothesis would be nice.

Didn't present one.  I mentioned an obvious fact (with small s, hapaxes
-- tokens with counts that are either 0 1 or 1 0 -- will, when seen
again, weigh in with a high deviation) and an observation that, at the
time I made it, surprised me (removing all hapaxes brought accuracy
down from 0.99 to 0.9 or so).  But no hypothesis.

> I don't use bogotune, as I don't keep large volumes of spam kicking
> around.  Nor will I ever wish to.

Again, nothing wrong with that, if it works for you.  I can't say
devoting a couple of gig to that garbage thrills me much either, but at
least gigs on disk are cheap these days.

> Theory would be appreciated.

Well, what I didn't feel like doing last night was to try to think
through _why_ hapaxes make such a difference in my production
environment -- which is, as I pointed out above, not a conclusion but
an observation.  The first explanation that occurs to me is that
they're hapaxes because I'm now training on error, not because they are
in fact scarcely ever encountered.  As I've mentioned before, I
recommend and practise full training up to about 10-20,000 each of spam
and nonspam, and training on error (messages classified wrongly or as
unsure) thereafter.  I seem to have built up quite a large vocabulary
in the full-training period (my current wordlist at home has about
770,000 tokens in it, and at work there are 5,470,037 of them).  After
a period of reinforcement with training on error, both seem to be doing
quite a good job of classification, thereby making it unnecessary to
increase the counts with many new entries.  (By "quite a good job" I
mean, for example, that I'm getting around 0.01% of false positives --
I've only seen one so far this year, out of 60,000-odd emails -- and
0.8% or so of false negatives in my personal mail.  That's good for me,
though an ISP would probably hike the spam cutoff up so as to keep fp
well below my level.)

It's an obvious hazard of training on error that it prevents the
training database from becoming and staying fully representative of the
message population as a whole.  The low-count tokens that remain strong
enough to contribute to correct discrimination are an extreme example
of this.  Perhaps that's not a drawback: although it certainly seems
risky as a strategy, it appears to be working in practice for many of
us who've tried it (and the Spambayes folks drew a similar conclusion).

Another effect of the partial training-on-error scheme I follow seems
to be that optimal results are obtained when a very low min_dev is
used. When I use my production training db, bogotune typically
recommends min_dev in the range of about 0.02 to 0.05.  Yet, if I take
a portion of the messages available for tuning and build a new database
by full training, bogotune will end up recommending a min_dev 'way up
above 0.4 when that database is used.  This would seem to imply that
the production training db has a lot of words of low significance (fw
close to 0.5) that nevertheless can be seen above the noise -- which
may not be the case when a new training db is built with full training.

I doubt that I've satisfied your quest for theory -- the fact is that I
don't entirely understand the results I'm getting.  We all know that
we're misapplying Bayesian classification in using it on email in the
way that we do; our excuse is that what we do seems to work fairly well
and cost relatively little.  Explaining why is none too easy.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |