"unknown" classification?

Fri Jul 26 00:42:14 CEST 2013

Am 25.07.2013 23:00, schrieb Tamer Yousef:
> I have bogofilter version 1.2.4, with the following number in the training
> set:
> spam: 82,799 & good:101,798
> 
> I ran the following command through the filter and I got results as Unknown
> with 0.52 score.
> 
> bogofilter -t -vvv <<< "
> 
> I work in games and simulations. And over the years games have become
> increasingly more prevalent as a kind of touchstone for design more
> broadly. So the project that I’m working on now that I’m going to talk
> about is an attempt to generalize from my experience in game design to
> almost anything else."
> 
> here is the output of the command:
> 
>                                         n    pgood     pbad      fw     U
>   "going"                           34609  0.247117  0.114168  0.316006 -
>   "anything"                        15831  0.109992  0.055967  0.337233 -
...
>   "increasingly"                     1865  0.005982  0.015169  0.717163 -
>   "simulations"                        62  0.000138  0.000580  0.808173 -
>   N_P_Q_S_s_x_md                        0  0.000000  0.000000  0.520000
>                                            0.017800  0.520000  0.375000
> 
> 
> I do not really understand the meaning of the headers at the top "pgood
> pbad      fw", and hence these values does not make sense to me, I'm
> wondering why this text which is non-spam was not identified as such? and
> fell into the unknown bucket. The training set I have is mostly well
> categorized except some cases but the text I'm examining does not have any
> spam-like tokens.
> any help on this is appreciated !

Tamer,

pgood and pbad are the probabilities that a given token (on that same
line) was found in a ham or spam message.  Based on those, f(w) (printed
as fw) is calculated, it is the degree of belief that a token is spam.
The details are in <http://www.linuxjournal.com/article/6467>; the
calculations are made in the file src/prob.c (and the chi-squared
distribution is taken from the GNU Scientific Library, GSL).

Now, the text apparently lacks tokens that are clearly and predominantly
found in good messages (close to 0 - your tokens start out at 0.316
only); and ultimately, given your parameter set (or the default
parameters), all tokens are deemed to indiscriminate and ignored (U
means "used", - in that column means unused, + means used).

Given there are no tokens deemed significant enough, you end up with the
default score of 0.52 which is "unknown".

You can tweak parameters a bit (toy with bogofilter's -m option); the
parameters can be viewed with bogofilter -Q, or -QQ, and the defaults
with bogofilter -QQC -- tokens are taken into account if |f(w) - 0.5| >
min-dev - and the min-dev default is 0.375, so only tokens with f(w) <
0.125 or f(w) > 0.875 are used for the calculation.

You might get away with a min-dev from .1 meaning to take tokens with
f(w) < 0.4 or f(w) > 0.6 into account, but carefully check if the result
is plausible.

There are more parameters to tweak, but remember that a statistic like
yours with a decent training set like yours is less relevant for
isolated cases.  Meaning to say, do not tweak the parameter for just a
few messages, but instead see to that the overall outcome seems right to
you.  There are tools such as bogotune to help you optimized parameters,
but again, since this is statistics, it will not guarantee to work out
for individual experiments (messages), but only asymptotically for many
inputs to be examined.

Hope that helps!

Best regards
Matthias