token degeneration

David Relson relson at osagesoftware.com
Tue Jul 29 17:05:59 CEST 2003


Greetings,

Back in May, Paul Graham published a second article "Better Bayesian 
Filtering", http://www.paulgraham.com/better.html, in which he detailed 
several parsing changes that he found improved spam filtering.  The changes 
included case sensitivity, tagging words in key header lines (from, to, 
subject, etc), and tagging words in key html tags ("a", "font", and 
"img").  The usefulness of these changes was verified and bogofilter's 
parsing was modified.  Since version 0.13.0, bogofilter has included these 
changes (as its default behavior).

Also mentioned in the article was an idea, called "degeneration" for 
dealing with certain unmatched tokens.  For example, if the database has 
entries for "free" and "Free" and the token "FREE" is encountered, it's 
reasonable to look at the two known entries and use one of them.  If you're 
interested, the full details of his idea are in the article.

To get back to bogofilter, I've added degeneration capabilities.  I've also 
provided options to turn them on or off since the effort to match an 
unknown token can be time consuming.  Here are the new command line flags 
for degeneration:

-PD - disable degeneration (default)
-Pd - enable degeneration
-Pf - enable first match (default)
-PF - enable best indicator

To explain a bit more...

If degeneration is disabled, bogofilter will operate as it does now.  When 
degeneration is enabled and bogofilter encounters a new, unknown token 
containing capital letters and/or exclamation points, the degeneration code 
will kick in and additional tokens will be generated and looked for (in the 
wordlist).  Which token is actually used for the match depends on the 
"first match" / "best indicator" setting.

If "first match" is enabled, bogofilter will stop searching when a match is 
found.  The spam/ham counts for the found token will be used in 
scoring.  If "best indicator" is enabled, the wordlist is checked for all 
the alternatives and bogofilter will use the one with the most extreme 
score, i.e. the one whose score is furthest from EVEN_ODDS (0.5).

I'm running some tests on the effectiveness of "first match" and "best 
indicator".  I'll report those results when I have them.

Anyhow the code is in cvs for those of you who like to live on the cutting 
edge.  Tester feedback is wanted !!!

The code will be included in bogofilter 0.14.1 later this week.

David





More information about the Bogofilter mailing list