Bogofilter for general filesystem classification

Sun Sep 14 06:13:22 CEST 2003

Ben,

I'm sure Greg (the writer of bogotune) will comment on this thread
within a day or two and will be able to give you a statistically more
valid response than I can.  Until then ...

Bogotune will do the best it can with the data given to it.  I think
it's worth running it even if you can't provide enough data (files) to
please it.  The numbers generated will be informative, if not
definitive.

As regards linearity of the results, algorithm choice is relevant.  The
algorithm recommended for classifying messages is the Robinson-Fisher
algorithm (selectable by the '-f' flag, or algorithm=fisher in the
config file).  It is bogofilter's default algorithm.  However it is not
linear.  There's also the Robinson-GeometricMean algorithm ('-r',
algorithm=robinson), which is linear (more or less).  Computing the R-F
algorithm first involves computing the R-GM result and then applying a
chi square result (Fisher's modification) to evaluate the likelihood
that the result is spam or ham (given the number of tokens involved). 
This process tends to clump ham scores down near zero and spam scores up
near one.  It also has the "side effect" that messages whose ham-ness
and spam-ness isn't clear are put in the middle (near 1/2).  This is the
"Unsure" group.  Basically you have a choice of linearity (from R-GM)
and ham/spam classification or non-linearity (from R-F) and
ham/spam/unsure classification.

Regarding long option names, you are correct.  Bogofilter doesn't
support them.

David

On 14 Sep 2003 13:27:02 +1000
Ben Martin <monkeyiq at users.sourceforge.net> wrote:

> On Sun, 2003-09-14 at 01:57, David Relson wrote:
> > Hello Ben,
> > 
> > You raise some interesting questions.  It seems probable that a
> > bayesian filter can assign emblems, given the needed data, i.e.
> > wordlists. That's a matter of training, which is simple enough. 
> > Returning values in range -100::+100 is easy too.  Determining
> > optimal values for ham_cutoff and spam_cutoff is harder.  There are
> > two main methods - the empirical methods (a.k.a. trial and error)
> > and bogotune.  I would recommend creating your test corpora and
> > running bogotune (which is in the bogofilter/tuning subdirectory) to
> > determine parameters that fit_your_ mix of messages (files).
> 
> Hmm, does bogotune do very bad things if there are less than 2000 spam
> and 2000 ham in the database? If there was an ability to override this
> then I could do a bogotune after I have added all test cases in a
> train(); But as it is I could only do that if I have 4000 trained
> examples and 1000 additional examples to pass bogotune. Obviously the
> results of bogofilter will get much worse as the number of training
> examples drops but allowing training + tuning on fewer examples may
> give me the ability to get a ham_cutoff and spam_cutoff from bogotune.
> Obviously having a --allow-much-less-optimal-tuning option for
> bogotune would be the way so that the user knew that they held a large
> gun at their foot by using the option. Apart from that then I think
> the rough stab at values for cutoffs would be the only other option.
> 
> What is the range of the value for a -T -3 classifcation run? given
> the range (and if its a linear scale) I should be able to nicely scale
> it to-100:0:+100. (See also me later reply to the config file option).
> 
> > 
> > FWIW, as a minor shortcut, "-PH -Pi -Pt" can be shortened to
> > "-PHit". 
> 
> I might combine -PHi but I'm looking to set -Pt/-PT depending on if
> the input file's mimetype is HTML like.

Fair enough.

> 
> > Keep us posted on your project.  It sounds interesting.
> 
> :) The main issue I have now project wise is whether folks using a
> filesystem/file manager are as willing to provide a large enough
> training sample for svm/bays stuff to work acceptably. 
> 
> > 
> > Also, it'd be quite easy to add a "scoring_range={min},{max}" option
> > to the config file.  Let me know if you need it.
> 
> Well, assuming that folks will want to take the -T -3 output in other
> apps aswell it would be handy to have. I presume that given no long
> option names in the man page for bogofilter having a
> --scoring-range="" would be much more hassle.