the importance of robx

Greg Louis glouis at
Sun Feb 29 02:04:12 CET 2004

On 20040228 (Sat) at 1917:07 -0500, Tom Anderson wrote:
> On Sat, 2004-02-28 at 18:51, David Relson wrote:
> > And then I thought of a wordlist histogram and the large numbers of pure
> > ham/spam and the almost as large numbers of hapaxes.  Given that hapaxes
> > are so numerous, one can conclude that many words never get beyond their
> > hapax/robx value.
> I thought that too, but then I had a conversation with pi about it which
> made a lot of sense... in general, your robx value is usually closer to
> 0.5 than your min_dev value.

Not necessarily.  For the last year or so my x values have been
around 0.51 to 0.6 and my min_dev (as recommended by bogotune and
predecessors) 0.02 to 0.0316.

>  My robx is 0.48 and my min_dev is 0.2. 
> This means that hapaxes will have no effect on your classifications.

I think you mean unknowns.  If a token has been seen exactly once
before, it will have quite a strong influence that will be diluted by x
to the degree specified by the s value.  Most of us use quite small s
values so our hapaxes count heavily in classification.  I once removed
all hapaxes from my training db to see what would happen, and
bogofilter's accuracy worsened by an order of magnitude!

> Not until they are registered several times (how many depending on robs)
> and pull out of that min_dev zone.  Therefore, robx has very little
> effect.  In fact, I'd wager it is almost equivalent anywhere arbitrarily
> between 0.5-min_dev and 0.5+min_dev.

Running bogotune will contradict this apparently reasonable prediction. 
Even when min_dev is above 0.4, varying x between 0.3 and 0.7 is seen
to make a significant difference to the classification accuracy.  I am
not willing at the moment to try to explain this theoretically (it's
Saturday night and there was a dam' fine bottle of Beaujolais with
dinner) but I'll vouch for its reproducibility with message corpora
from several different sources.

David, I forgot to reply to the list as you requested; perhaps you
could quote me in full if you produce a rejoinder to that reply :)

| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
| |   (on my website or any keyserver) |
| in signatures helps fight junk email. |

More information about the Bogofilter mailing list