the importance of robx
relson at osagesoftware.com
Sat Feb 28 19:32:45 EST 2004
On 28 Feb 2004 19:17:07 -0500
Tom Anderson wrote:
> On Sat, 2004-02-28 at 18:51, David Relson wrote:
> > And then I thought of a wordlist histogram and the large numbers of
> > pure ham/spam and the almost as large numbers of hapaxes. Given
> > that hapaxes are so numerous, one can conclude that many words never
> > get beyond their hapax/robx value.
> I thought that too, but then I had a conversation with pi about it
> which made a lot of sense... in general, your robx value is usually
> closer to 0.5 than your min_dev value. My robx is 0.48 and my min_dev
> is 0.2. This means that hapaxes will have no effect on your
> classifications. Not until they are registered several times (how many
> depending on robs) and pull out of that min_dev zone. Therefore, robx
> has very little effect. In fact, I'd wager it is almost equivalent
> anywhere arbitrarily between 0.5-min_dev and 0.5+min_dev.
Thanks for joining in on this conversation. Greg & I have run bogotune
many times and have spent much time tuning and tweaking and thinking
about the whys and the wherefores.
Here're the first few lines of "bogotune -vv" with 8000 each ham/spam in
the wordlist and 32000 each ham/spam used in the tuning. The columns
show cnt (iteration number), robs, min_dev, robx, spam_cutoff, false
positive, and false negative counts. Note: the fp values are
"engineered" for a certain pct of the message count and the spam_cutoff
is picked to make this fp happen.
Part of what initiated this message thread is looking at the big changes
in fn corresponding to a change in only robx. Notice how the rx changes
(in increments of 0.05) have major differences in the fn counts.
cnt rs md rx cutoff fp fn
1 1.0000 0.050 0.439 0.929204 23 1868
2 1.0000 0.050 0.389 0.906199 23 1971
3 1.0000 0.050 0.489 0.973740 23 1501
4 1.0000 0.050 0.339 0.867208 23 2034
5 1.0000 0.050 0.539 0.977234 23 1463
> That brings up the question... what happens if you have a message
> composed entirely of new words?
It's score is robx. Robx should be less than spam_cutoff so that the
message is considered ham, not spam. This is desirable because it's
better to haver false negatives (spam getting through) to false
positives (ham getting canned).
It's virtually impossible to have a message composed entirely of new
words. After all, your email address appears in several tokens in
bogofilter's parsing :-)
More information about the Bogofilter