[PATCH] Better tagging.
relson at osagesoftware.com
Sat Sep 13 08:25:22 EDT 2003
Bogofilter's development is an ongoing process. The set of headers
selected for tagging is based on Paul Graham's "Better Bayesian
Filtering" article and is not cast in stone.
As you know, there's been some recent work to exclude tokens likely to
be unique, in particular message IDs. That's also why Delivery-Date:,
Resent-Message-ID:, In-Reply-To:, and References: get special treatment.
Seems like we now have three sets rules:
1 - original
2 - specially treated (as described above)
3 - proposed changes.
Looks like I'll have to run some tests to measure the effectiveness of
the different rule sets.
Using "char *" instead of "word_t *" is pretty painless. It does make
the parser API less uniform a bad thing. Given that set_tag()'s
parameter isn't used in other routines, it should be acceptable.
Explicitly returning NONE (or perhaps EOF or EOM or something) rather
than 0 is good. I recently made some similar changes, converting -1's
to EOF, which is accurate and more informative.
Thanks for the PATCH. You should see some of it in the next release.
Gotta go now - Saturday morning familial duties :-)
David Relson Osage Software Systems, Inc.
relson at osagesoftware.com Ann Arbor, MI 48103
www.osagesoftware.com tel: 734.821.8800
More information about the Bogofilter-dev