Bogofilter for general filesystem classification

Ben Martin monkeyiq at users.sourceforge.net
Sun Sep 14 05:27:02 CEST 2003


On Sun, 2003-09-14 at 01:57, David Relson wrote:
> Hello Ben,
> 
> You raise some interesting questions.  It seems probable that a bayesian
> filter can assign emblems, given the needed data, i.e. wordlists. 
> That's a matter of training, which is simple enough.  Returning values
> in range -100::+100 is easy too.  Determining optimal values for
> ham_cutoff and spam_cutoff is harder.  There are two main methods - the
> empirical methods (a.k.a. trial and error) and bogotune.  I would
> recommend creating your test corpora and running bogotune (which is in
> the bogofilter/tuning subdirectory) to determine parameters that fit
> _your_ mix of messages (files).

Hmm, does bogotune do very bad things if there are less than 2000 spam
and 2000 ham in the database? If there was an ability to override this
then I could do a bogotune after I have added all test cases in a
train(); But as it is I could only do that if I have 4000 trained
examples and 1000 additional examples to pass bogotune. Obviously the
results of bogofilter will get much worse as the number of training
examples drops but allowing training + tuning on fewer examples may give
me the ability to get a ham_cutoff and spam_cutoff from bogotune.
Obviously having a --allow-much-less-optimal-tuning option for bogotune
would be the way so that the user knew that they held a large gun at
their foot by using the option. Apart from that then I think the rough
stab at values for cutoffs would be the only other option.

What is the range of the value for a -T -3 classifcation run? given the
range (and if its a linear scale) I should be able to nicely scale it to
-100:0:+100. (See also me later reply to the config file option).

> 
> FWIW, as a minor shortcut, "-PH -Pi -Pt" can be shortened to "-PHit". 

I might combine -PHi but I'm looking to set -Pt/-PT depending on if the
input file's mimetype is HTML like.

>  
> 
> Keep us posted on your project.  It sounds interesting.

:) The main issue I have now project wise is whether folks using a
filesystem/file manager are as willing to provide a large enough
training sample for svm/bays stuff to work acceptably. 

> 
> Also, it'd be quite easy to add a "scoring_range={min},{max}" option to
> the config file.  Let me know if you need it.

Well, assuming that folks will want to take the -T -3 output in other
apps aswell it would be handy to have. I presume that given no long
option names in the man page for bogofilter having a --scoring-range=""
would be much more hassle.


> David
> 
> On 14 Sep 2003 01:26:24 +1000
> Ben Martin <monkeyiq at users.sourceforge.net> wrote:
> 
> > Hi,
> >   Although it seems that most Bayesian classifiers are being used for
> > SPAM detection I'm looking to use them to help assign emblems to files
> > in a filesystem [1] (a classification by any other name).
> > 
> > The top screenshot on [2] shows some of the new agent stuff I've added
> > to libferris to go about allowing SVM and Bayesian stuff to interact
> > with general filesystems.
> > 
> > Which beings me to my main question from RTFMing on bogofilter.
> > You'll notice that the command I am using to get bogofilter to give
> > its classification I use -T -3 
> > "bogofilter  -d /tmp/my-new-agent -W  -PH -Pi  -PT  -T -3"
> > 
> > >From the man page for -3 option 
> >   "This option is effective only if ham_cutoff is non-zero"
> > 
> > What would folks recommend for the spam/ham cutoffs here? From the
> > point of view of libferris I want to turn the result value into a
> > double from-100 to 100 with 0 meaning unsure 100 being SPAM and -100
> > being HAM. Any values in between are captured as a fuzzy assertion
> > toward that classification (this assumes that the training cases are
> > treating an emblem being assigned as SPAM).
> > 
> > Thoughts on this would be great, and we should add some more comment
> > to the man page for -o [v][v] from this thread.
> > 
> > Thanks.
> > 
> > [1] http://witme.sourceforge.net/libferris.web/
> > [2] http://witme.sourceforge.net/libferris.web/research/shots.html
> > 
> > 
> > ---------------------------------------------------------------------
> > FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> > To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> > For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> > For more commands, e-mail: bogofilter-help at aotto.com
> 
> 
> -- 
> David Relson                   Osage Software Systems, Inc.
> relson at osagesoftware.com       Ann Arbor, MI 48103
> www.osagesoftware.com          tel:  734.821.8800
> 
> 
> -- 
> David Relson                   Osage Software Systems, Inc.
> relson at osagesoftware.com       Ann Arbor, MI 48103
> www.osagesoftware.com          tel:  734.821.8800
> 





More information about the Bogofilter mailing list