Bogofilter accuracy plummets starting around March 10, 2010

Sun Apr 4 15:56:00 CEST 2010

On 04/04/2010 09:37 AM, Thomas Anderson wrote:
> That seems OK in regard to 0.5-min_dev<  robx<  0.5+min_dev<
> spam_cutoff.  However, your cutoffs seem way to high to me.
I am fairly certain that bogotune is picking the optimal cutoffs for the 
spam and ham I receive; that is, after all, the whole point of it, is it 
not?
> Spams vary far too much to be constrained to only 0.4% of scoring.
And yet, until March 10, tolerances that narrow were catching over 98% 
of the spam being sent to me.  To corroborate this, here is what my 
.bogofilter.cf looked like on March 1 (restored from backup), more than 
a week before I started having this problem, when bogofilter was still 
working great:

db_cachesize=43
robs=0.0100
min_dev=0.394
robx=0.600000
sp_esf=0.154134
ns_esf=0.003662
spam_cutoff=0.900726    # for 0.05% fp (3); expect 0.42% fn (27).
ham_cutoff=0.450
> It
> seems like you will naturally get lots of unsures and false negatives
> with those numbers.
And yet, I wasn't.
> If they're not in the RFC and you don't need them for your particular
> setup (you can include/exclude any headers you like), then why would you
> need them?  In any event, that's configurable.  Also, why would you want
> a received line which is clearly forged?
I don't believe in throwing away data.  I've seen way too many cases of 
people saying, "There's no harm in throwing that away, what could we 
possibly need it for?" only to discover, too late, that it was, in fact, 
needed for something.

As just one example, I see that spamitarium doesn't preserve X-Face 
headers, nor does it preserve standard mailing-list fields.  I don't 
really want to spend weeks or months discovering by dribs and drabs the 
other headers I wish it preserved that it doesn't.  If it were easy for 
me to use spamitarium in my milter setup, then I might consider spending 
the time to play this game, but considering that I'd have to do this in 
addition to figuring out how to revamp my whole setup to accommodate 
spamitarium, I'm not too keen on the idea.

Not to mention the fact that I'm reluctant to leap to the conclusion 
that this is the only way to solve my problem when, as I mentioned, 
until less than a month ago bogofilter by itself was filtering my spam 
with >98% accuracy.

As for Received lines, I have quite a bit of experience parsing them, 
and I am frankly distrustful of spamitarium's ability to determine with 
100% accuracy whether any particular Received line is valid.  The result 
of a mistake in this area is an inability to identify the real source of 
the spam, e.g., so that it can be SpamCopped.
> I also use a milter, but with more lenient tolerances because I don't
> want to reject at SMTP any false positives at all.  I run spamitarium
> and bogofilter again (via Procmail) on anything which gets through the
> milter.  In the future, I'll build a milter which includes spamitarium,
> so that'll become easier.
I will be happy to reconsider my reluctance to use spamitarium-like 
functionality when you have written such a milter.  Let me know ;-).

   jik