"unknown" classification?
David Relson
relson at osagesoftware.com
Fri Jul 26 02:39:29 CEST 2013
On Fri, 26 Jul 2013 00:42:14 +0200
Matthias Andree wrote:
> Am 25.07.2013 23:00, schrieb Tamer Yousef:
> > I have bogofilter version 1.2.4, with the following number in the
> > training set:
> > spam: 82,799 & good:101,798
> >
> > I ran the following command through the filter and I got results as
> > Unknown with 0.52 score.
> >
> > bogofilter -t -vvv <<< "
> >
> > I work in games and simulations. And over the years games have
> > become increasingly more prevalent as a kind of touchstone for
> > design more broadly. So the project that I’m working on now that
> > I’m going to talk about is an attempt to generalize from my
> > experience in game design to almost anything else."
> >
> > here is the output of the command:
> >
> > n pgood pbad
> > fw U "going" 34609 0.247117
> > 0.114168 0.316006 - "anything" 15831
> > 0.109992 0.055967 0.337233 -
> ...
> > "increasingly" 1865 0.005982 0.015169
> > 0.717163 - "simulations" 62 0.000138
> > 0.000580 0.808173 - N_P_Q_S_s_x_md 0
> > 0.000000 0.000000 0.520000 0.017800 0.520000 0.375000
> >
> >
> > I do not really understand the meaning of the headers at the top
> > "pgood pbad fw", and hence these values does not make sense to
> > me, I'm wondering why this text which is non-spam was not
> > identified as such? and fell into the unknown bucket. The training
> > set I have is mostly well categorized except some cases but the
> > text I'm examining does not have any spam-like tokens.
> > any help on this is appreciated !
>
> Tamer,
>
> pgood and pbad are the probabilities that a given token (on that same
> line) was found in a ham or spam message. Based on those, f(w)
> (printed as fw) is calculated, it is the degree of belief that a
> token is spam. The details are in
> <http://www.linuxjournal.com/article/6467>; the calculations are made
> in the file src/prob.c (and the chi-squared distribution is taken
> from the GNU Scientific Library, GSL).
>
> Now, the text apparently lacks tokens that are clearly and
> predominantly found in good messages (close to 0 - your tokens start
> out at 0.316 only); and ultimately, given your parameter set (or the
> default parameters), all tokens are deemed to indiscriminate and
> ignored (U means "used", - in that column means unused, + means used).
>
> Given there are no tokens deemed significant enough, you end up with
> the default score of 0.52 which is "unknown".
>
> You can tweak parameters a bit (toy with bogofilter's -m option); the
> parameters can be viewed with bogofilter -Q, or -QQ, and the defaults
> with bogofilter -QQC -- tokens are taken into account if |f(w) - 0.5|
> > min-dev - and the min-dev default is 0.375, so only tokens with
> > f(w) <
> 0.125 or f(w) > 0.875 are used for the calculation.
>
> You might get away with a min-dev from .1 meaning to take tokens with
> f(w) < 0.4 or f(w) > 0.6 into account, but carefully check if the
> result is plausible.
>
> There are more parameters to tweak, but remember that a statistic like
> yours with a decent training set like yours is less relevant for
> isolated cases. Meaning to say, do not tweak the parameter for just a
> few messages, but instead see to that the overall outcome seems right
> to you. There are tools such as bogotune to help you optimized
> parameters, but again, since this is statistics, it will not
> guarantee to work out for individual experiments (messages), but only
> asymptotically for many inputs to be examined.
>
> Hope that helps!
>
> Best regards
> Matthias
Hello Tamer,
As additional information, look at bogofilter's config file. Typically
it's in /etc/bogofilter.cfg, though your installation may have placed
it elsewhere.
In the config file is the following:
########### Classification Constants Settings #######################
#
# See man page for a more detailled description of the parameters.
#### MINIMUM DEVIATION
#
# if token spamicity closer to EVEN_ODDS (0.5)
# than MIN_DEV, don't use the word in the
# spamicity calculation
#
#min_dev=0.375 # default
This parameter is also settable on the command line with the "-m"
option (as Matthias has already described to you).
HTH,
David
More information about the Bogofilter
mailing list