Alternative use for bogofilter
relson at osagesoftware.com
Mon Jun 6 17:53:57 EDT 2005
On Mon, 06 Jun 2005 16:44:22 +0200
Helge Preuss wrote:
> I need to automatically categorize HTML pages based on their content. I
> had the idea to use bogofilter for this.
> This is how I go about it:
> - download examples of web pages of a category, and counterexamples
> - train bogofilter to use the pages belonging to the desired category as
> ham, and the counterexamples as spam
> - move the generated database to a separate directory
> - repeat for every category I want to autodetect
> When I want to detect if an HTML page belongs to a specific category, I
> give the path to the corresponding database with the -d switch.
> My first tests showed encouraging results, but before I go further I'd
> like to ask you whether anyone has done this before, if I overlook
> princial limitations of bogofilter or Bayes filtering in general, or if
> you have any other thoughts or comments.
Sounds like a workable plan. There have been periodic questions about
using bogofilter for other types of binary classification. Your goals
sound more ambitious. I think you'll be breaking new ground. It
should be doable.
One small tip - use the '-H' switch. It'll tell bogofilter to skip the
normal special processing for email header lines.
More information about the Bogofilter