the importance of robx

Tom Anderson tanderso at oac-design.com
Sun Feb 29 01:17:07 CET 2004


On Sat, 2004-02-28 at 18:51, David Relson wrote:
> And then I thought of a wordlist histogram and the large numbers of pure
> ham/spam and the almost as large numbers of hapaxes.  Given that hapaxes
> are so numerous, one can conclude that many words never get beyond their
> hapax/robx value.

I thought that too, but then I had a conversation with pi about it which
made a lot of sense... in general, your robx value is usually closer to
0.5 than your min_dev value.  My robx is 0.48 and my min_dev is 0.2. 
This means that hapaxes will have no effect on your classifications. 
Not until they are registered several times (how many depending on robs)
and pull out of that min_dev zone.  Therefore, robx has very little
effect.  In fact, I'd wager it is almost equivalent anywhere arbitrarily
between 0.5-min_dev and 0.5+min_dev.  

That brings up the question... what happens if you have a message
composed entirely of new words?

Tom



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040228/9a755f6e/attachment.sig>


More information about the Bogofilter mailing list