Bogofilter accuracy plummets starting around March 10, 2010
Jonathan Kamens
jik at kamens.brookline.ma.us
Sun Apr 4 15:56:00 CEST 2010
On 04/04/2010 09:37 AM, Thomas Anderson wrote:
> That seems OK in regard to 0.5-min_dev< robx< 0.5+min_dev<
> spam_cutoff. However, your cutoffs seem way to high to me.
I am fairly certain that bogotune is picking the optimal cutoffs for the
spam and ham I receive; that is, after all, the whole point of it, is it
not?
> Spams vary far too much to be constrained to only 0.4% of scoring.
And yet, until March 10, tolerances that narrow were catching over 98%
of the spam being sent to me. To corroborate this, here is what my
.bogofilter.cf looked like on March 1 (restored from backup), more than
a week before I started having this problem, when bogofilter was still
working great:
db_cachesize=43
robs=0.0100
min_dev=0.394
robx=0.600000
sp_esf=0.154134
ns_esf=0.003662
spam_cutoff=0.900726 # for 0.05% fp (3); expect 0.42% fn (27).
ham_cutoff=0.450
> It
> seems like you will naturally get lots of unsures and false negatives
> with those numbers.
And yet, I wasn't.
> If they're not in the RFC and you don't need them for your particular
> setup (you can include/exclude any headers you like), then why would you
> need them? In any event, that's configurable. Also, why would you want
> a received line which is clearly forged?
I don't believe in throwing away data. I've seen way too many cases of
people saying, "There's no harm in throwing that away, what could we
possibly need it for?" only to discover, too late, that it was, in fact,
needed for something.
As just one example, I see that spamitarium doesn't preserve X-Face
headers, nor does it preserve standard mailing-list fields. I don't
really want to spend weeks or months discovering by dribs and drabs the
other headers I wish it preserved that it doesn't. If it were easy for
me to use spamitarium in my milter setup, then I might consider spending
the time to play this game, but considering that I'd have to do this in
addition to figuring out how to revamp my whole setup to accommodate
spamitarium, I'm not too keen on the idea.
Not to mention the fact that I'm reluctant to leap to the conclusion
that this is the only way to solve my problem when, as I mentioned,
until less than a month ago bogofilter by itself was filtering my spam
with >98% accuracy.
As for Received lines, I have quite a bit of experience parsing them,
and I am frankly distrustful of spamitarium's ability to determine with
100% accuracy whether any particular Received line is valid. The result
of a mistake in this area is an inability to identify the real source of
the spam, e.g., so that it can be SpamCopped.
> I also use a milter, but with more lenient tolerances because I don't
> want to reject at SMTP any false positives at all. I run spamitarium
> and bogofilter again (via Procmail) on anything which gets through the
> milter. In the future, I'll build a milter which includes spamitarium,
> so that'll become easier.
I will be happy to reconsider my reluctance to use spamitarium-like
functionality when you have written such a milter. Let me know ;-).
jik
More information about the Bogofilter
mailing list