Alternative use for bogofilter

Helge Preuss helge.preuss at gmx.net
Mon Jun 6 16:44:22 CEST 2005


Hi,

I need to automatically categorize HTML pages based on their content. I 
had the idea to use bogofilter for this.

This is how I go about it:
- download examples of web pages of a category, and counterexamples
- train bogofilter to use the pages belonging to the desired category as 
ham, and the counterexamples as spam
- move the generated database to a separate directory
- repeat for every category I want to autodetect
When I want to detect if an HTML page belongs to a specific category, I 
give the path to the corresponding database with the -d switch.

My first tests showed encouraging results, but before I go further I'd 
like to ask you whether anyone has done this before, if I overlook 
princial limitations of bogofilter or Bayes filtering in general, or if 
you have any other thoughts or comments.

Thanks,

Helge



More information about the Bogofilter mailing list