Alternative use for bogofilter

David Relson relson at osagesoftware.com
Mon Jun 6 23:53:57 CEST 2005


On Mon, 06 Jun 2005 16:44:22 +0200
Helge Preuss wrote:

> Hi,
> 
> I need to automatically categorize HTML pages based on their content. I 
> had the idea to use bogofilter for this.
> 
> This is how I go about it:
> - download examples of web pages of a category, and counterexamples
> - train bogofilter to use the pages belonging to the desired category as 
> ham, and the counterexamples as spam
> - move the generated database to a separate directory
> - repeat for every category I want to autodetect
> When I want to detect if an HTML page belongs to a specific category, I 
> give the path to the corresponding database with the -d switch.
> 
> My first tests showed encouraging results, but before I go further I'd 
> like to ask you whether anyone has done this before, if I overlook 
> princial limitations of bogofilter or Bayes filtering in general, or if 
> you have any other thoughts or comments.
> 
> Thanks,
> 
> Helge

Hello Helge,

Sounds like a workable plan.  There have been periodic questions about
using bogofilter for other types of binary classification.  Your goals
sound more ambitious.  I think you'll be breaking new ground.  It
should be doable.

One small tip - use the '-H' switch.  It'll tell bogofilter to skip the
normal special processing for email header lines.

HTH,

David




More information about the Bogofilter mailing list