Test with different lexers

Tom Anderson tanderso at oac-design.com
Tue Dec 2 15:39:05 CET 2003


On Tue, 2003-12-02 at 08:19, Boris 'pi' Piwinger wrote:
> IIRC it was Tom who gave a strong opinion why we should
> really just let the statistics and don't intervene with
> special rules. If you want to try, just replace the lexer
> file and compile. It would be great if other people would
> repeat the test.

Pi, yes, I originally raised the objection to increasingly numerous and
mostly immaterial rules creeping into bogofilter.  Unfortunately, I have
zero time right now due to my job to actually perform any testing of my
own (just writing emails to the list stretches my free time!).  I still
hope to resume development of bfproxy sometime soon too.  However, I
greatly appreciate and applaud your efforts in tackling this issue,
which I believe is quite important.

It is our basic assumption in bogofilter that the Bayesian method
applies well to filtering spam, and that rules in general are inflexible
and require constant testing and modification and still only apply to
the 85th percentile, if lucky.  They are therefore unacceptable or
insufficient for spam-filtering for most users.  However, rules continue
to creep into bogofilter on good but misguided intentions (as you say,
to be more clever than the statistics).  We should demand very strong
evidence that a rule greatly improves scoring and/or efficiency for
everyone, no matter what, or is required due to an inherent property of
email structure or protocols before incorporating it into bogofilter.  I
don't think merely "useful" is reason enough, as a small incremental
improvement here and there can quickly reverse as corpi change. 
Rule-based filters like SpamAssassin seem to think that lots of rules
are "useful", yet aren't (and shouldn't be!) included in bogofilter. 
When in doubt, let Bayes figure it out.  

If you want a bunch of rules, do a seperate pre-filter.  In fact, I move
that all filtering rules currently in bogofilter be moved to just such a
pre-filter.  This way the core Bayesian functionality of bogofilter may
remain pure, and auxillary "useful" processing can remain unentangled. 
This modular approach, I believe, would lend itself to a much better
distinction between types of filtering and the intention of each.  And
this way, anyone unimpressed with rule-based filters (including the
tangled mess of procmail recipes many are familiar with) may simply not
execute that portion of the process and instead accept all tokens
unaltered for scoring with the Bayesian method.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031202/33079871/attachment.sig>


More information about the Bogofilter mailing list