token degeneration
David Relson
relson at osagesoftware.com
Tue Jul 29 17:05:59 CEST 2003
Greetings,
Back in May, Paul Graham published a second article "Better Bayesian
Filtering", http://www.paulgraham.com/better.html, in which he detailed
several parsing changes that he found improved spam filtering. The changes
included case sensitivity, tagging words in key header lines (from, to,
subject, etc), and tagging words in key html tags ("a", "font", and
"img"). The usefulness of these changes was verified and bogofilter's
parsing was modified. Since version 0.13.0, bogofilter has included these
changes (as its default behavior).
Also mentioned in the article was an idea, called "degeneration" for
dealing with certain unmatched tokens. For example, if the database has
entries for "free" and "Free" and the token "FREE" is encountered, it's
reasonable to look at the two known entries and use one of them. If you're
interested, the full details of his idea are in the article.
To get back to bogofilter, I've added degeneration capabilities. I've also
provided options to turn them on or off since the effort to match an
unknown token can be time consuming. Here are the new command line flags
for degeneration:
-PD - disable degeneration (default)
-Pd - enable degeneration
-Pf - enable first match (default)
-PF - enable best indicator
To explain a bit more...
If degeneration is disabled, bogofilter will operate as it does now. When
degeneration is enabled and bogofilter encounters a new, unknown token
containing capital letters and/or exclamation points, the degeneration code
will kick in and additional tokens will be generated and looked for (in the
wordlist). Which token is actually used for the match depends on the
"first match" / "best indicator" setting.
If "first match" is enabled, bogofilter will stop searching when a match is
found. The spam/ham counts for the found token will be used in
scoring. If "best indicator" is enabled, the wordlist is checked for all
the alternatives and bogofilter will use the one with the most extreme
score, i.e. the one whose score is furthest from EVEN_ODDS (0.5).
I'm running some tests on the effectiveness of "first match" and "best
indicator". I'll report those results when I have them.
Anyhow the code is in cvs for those of you who like to live on the cutting
edge. Tester feedback is wanted !!!
The code will be included in bogofilter 0.14.1 later this week.
David
More information about the Bogofilter
mailing list