"unknown" classification?

Fri Jul 26 02:39:29 CEST 2013

On Fri, 26 Jul 2013 00:42:14 +0200
Matthias Andree wrote:

> Am 25.07.2013 23:00, schrieb Tamer Yousef:
> > I have bogofilter version 1.2.4, with the following number in the
> > training set:
> > spam: 82,799 & good:101,798
> > 
> > I ran the following command through the filter and I got results as
> > Unknown with 0.52 score.
> > 
> > bogofilter -t -vvv <<< "
> > 
> > I work in games and simulations. And over the years games have
> > become increasingly more prevalent as a kind of touchstone for
> > design more broadly. So the project that I’m working on now that
> > I’m going to talk about is an attempt to generalize from my
> > experience in game design to almost anything else."
> > 
> > here is the output of the command:
> > 
> >                                         n    pgood     pbad
> > fw     U "going"                           34609  0.247117
> > 0.114168  0.316006 - "anything"                        15831
> > 0.109992  0.055967  0.337233 -
> ...
> >   "increasingly"                     1865  0.005982  0.015169
> > 0.717163 - "simulations"                        62  0.000138
> > 0.000580  0.808173 - N_P_Q_S_s_x_md                        0
> > 0.000000  0.000000  0.520000 0.017800  0.520000  0.375000
> > 
> > 
> > I do not really understand the meaning of the headers at the top
> > "pgood pbad      fw", and hence these values does not make sense to
> > me, I'm wondering why this text which is non-spam was not
> > identified as such? and fell into the unknown bucket. The training
> > set I have is mostly well categorized except some cases but the
> > text I'm examining does not have any spam-like tokens.
> > any help on this is appreciated !
> 
> Tamer,
> 
> pgood and pbad are the probabilities that a given token (on that same
> line) was found in a ham or spam message.  Based on those, f(w)
> (printed as fw) is calculated, it is the degree of belief that a
> token is spam. The details are in
> <http://www.linuxjournal.com/article/6467>; the calculations are made
> in the file src/prob.c (and the chi-squared distribution is taken
> from the GNU Scientific Library, GSL).
> 
> Now, the text apparently lacks tokens that are clearly and
> predominantly found in good messages (close to 0 - your tokens start
> out at 0.316 only); and ultimately, given your parameter set (or the
> default parameters), all tokens are deemed to indiscriminate and
> ignored (U means "used", - in that column means unused, + means used).
> 
> Given there are no tokens deemed significant enough, you end up with
> the default score of 0.52 which is "unknown".
> 
> You can tweak parameters a bit (toy with bogofilter's -m option); the
> parameters can be viewed with bogofilter -Q, or -QQ, and the defaults
> with bogofilter -QQC -- tokens are taken into account if |f(w) - 0.5|
> > min-dev - and the min-dev default is 0.375, so only tokens with
> > f(w) <
> 0.125 or f(w) > 0.875 are used for the calculation.
> 
> You might get away with a min-dev from .1 meaning to take tokens with
> f(w) < 0.4 or f(w) > 0.6 into account, but carefully check if the
> result is plausible.
> 
> There are more parameters to tweak, but remember that a statistic like
> yours with a decent training set like yours is less relevant for
> isolated cases.  Meaning to say, do not tweak the parameter for just a
> few messages, but instead see to that the overall outcome seems right
> to you.  There are tools such as bogotune to help you optimized
> parameters, but again, since this is statistics, it will not
> guarantee to work out for individual experiments (messages), but only
> asymptotically for many inputs to be examined.
> 
> Hope that helps!
> 
> Best regards
> Matthias

Hello Tamer,

As additional information, look at bogofilter's config file.  Typically
it's in /etc/bogofilter.cfg, though your installation may have placed
it elsewhere.

In the config file is the following:

########### Classification Constants Settings #######################
#
# See man page for a more detailled description of the parameters.

#### MINIMUM DEVIATION
#
#	if token spamicity closer to EVEN_ODDS (0.5)
#	than MIN_DEV, don't use the word in the
#	spamicity calculation
#
#min_dev=0.375				# default

This parameter is also settable on the command line with the "-m"
option (as Matthias has already described to you).

HTH,

David