can I do this with bogofilter

David Relson relson at osagesoftware.com
Thu Sep 1 04:30:19 CEST 2005


On Wed, 31 Aug 2005 19:33:59 +0200
Matthias Andree wrote:

> Tom Allison wrote:
> > I don't know if I can do this or not and the answer would almost require
> > a code review....
> > 
> > Can I use bogofilter to score in a binary fashion (ham/spam) a generic
> > text string to classify it into one of two pools?
> > 
> > It's definitely not email.
> > it's typically only one line, but very long.
> > I only need a binary classification.
> > 
> > Would this still work?
> > 
> > I've looked at some of the code available in CPAN perl modules and they
> > all tend to assume you are using email...
> 
> If the long line looks like text and can be broken up into words, then yes -
> you may however want to make the cutoff and robx settings symmetric, bogofilter
> default settings are very much and deliberately "lopsided", i. e. tilted
> towards "ham" to keep the false positive count near zero.
> 
> I hope I'm not forgetting anything, you'd set robx, spam_cutoff and ham_cutoff
> all to 0.5.
> 
> An alternative might otherwise be CRM114 <http://crm114.sourceforge.net/>
> 
> -- 
> Matthias Andree

Additionally, as it's not email and doesn't have email headers, you'll
want the "-H" flag to turn off the special processing of header tokens.

0.5 for spam and ham would classify _every_ text string as ham or
spam.  You might want to have unsures.  I'm thinking of a symmetric
interval around 0.5 for unsures.  One example would be 0.4 for
ham_cutoff and 0.6 for spam_cutoff.  Experimentation would be necessary.



More information about the Bogofilter mailing list