Bogofilter accuracy plummets starting around March 10, 2010

Mon Apr 5 23:20:55 CEST 2010

On 4/4/2010 9:56 AM, Jonathan Kamens wrote:
> I am fairly certain that bogotune is picking the optimal cutoffs for the
> spam and ham I receive; that is, after all, the whole point of it, is it
> not?

That is the point AFAIK, but I don't know how reliable it is.  I don't 
use it.  I cannot imagine it is better than managing them through 
experience over time.

> And yet, until March 10, tolerances that narrow were catching over 98%
> of the spam being sent to me.  To corroborate this, here is what my
> .bogofilter.cf looked like on March 1 (restored from backup), more than
> a week before I started having this problem, when bogofilter was still
> working great:

That's why you have to manage the configuration actively.  Things change.

>> It seems like you will naturally get lots of unsures and false negatives
>> with those numbers.
> And yet, I wasn't.

And yet, you are.  If everything is so grand, then why are you 
complaining about inaccuracy now?  I'm only trying to help.  I've been 
through periods like this where a particular kind of message seems like 
it's defeated the filter.  But lo and behold, if you manage your 
configuration and do training, everything is fine.  And in my 
experience, you need to have a larger range of spam scoring in order to 
handle the kinds of emails you're presently having trouble with.

> I don't believe in throwing away data.  I've seen way too many cases of
> people saying, "There's no harm in throwing that away, what could we
> possibly need it for?" only to discover, too late, that it was, in fact,
> needed for something.

So if someone hacked into your computer and planted a rootkit, you'd be 
happy to leave things as is?  That's equivalent to what you're saying. 
Spammers are purposely putting false data in the user-invisible header 
of the email.  And not only is it merely untrue, but it's meant to mimic 
data that you would normally trust.  The solution is to purge it, 
leaving only the true footprints of the spammer (and perhaps a note that 
the action was necessary).  If someone isn't actually coming from AOL or 
using Microsoft Outlook Express, then why foul your database with this 
false info?  It is superfluous at best, or malicious at worst.  The only 
information which is relevant to spam scoring is user-visible parts and 
verified routing info.  No sophisticated spammer is going to put in the 
header an X-SPAMMER: True line.  So if the user can't see it, it's 
probably forged to look hammy for Bayesian filters.

> As just one example, I see that spamitarium doesn't preserve X-Face
> headers, nor does it preserve standard mailing-list fields.  I don't
> really want to spend weeks or months discovering by dribs and drabs the
> other headers I wish it preserved that it doesn't.  If it were easy for
> me to use spamitarium in my milter setup, then I might consider spending
> the time to play this game, but considering that I'd have to do this in
> addition to figuring out how to revamp my whole setup to accommodate
> spamitarium, I'm not too keen on the idea.

Everything is clearly stated in the documentation and user variables 
section.  No weeks or months needed.  If you want an X-Face header, add 
it either in the user variables or on the command line.  The mailing 
list fields are actually already keyed in for you... you just have to 
uncomment the line.

> Not to mention the fact that I'm reluctant to leap to the conclusion
> that this is the only way to solve my problem when, as I mentioned,
> until less than a month ago bogofilter by itself was filtering my spam
> with>98% accuracy.

When spammers purposely put a lot of random garbage (or not random, but 
things that are tested to be generally hammy) into spams, this is one of 
the best ways to eradicate false hammy tokens and add real spammy ones. 
  I developed this program specifically for the kinds of problems that 
you're presently experiencing.  The key phrase in your above paragraph 
is "until less than a month ago..."  You are now experiencing the kind 
of situation for which spamitarium was developed and is suited to help 
you resolve.

> As for Received lines, I have quite a bit of experience parsing them,
> and I am frankly distrustful of spamitarium's ability to determine with
> 100% accuracy whether any particular Received line is valid.  The result
> of a mistake in this area is an inability to identify the real source of
> the spam, e.g., so that it can be SpamCopped.

It doesn't know 100% when it's tenuously valid, but it knows 100% when 
it's not valid.  Sometimes mail servers are misconfigured and the 
reverse DNS does not resolve to what it claims to be (e.g. helo will say 
mail.server.com while DNS will show bob.server.com).  In that case, 
spamitarium deems the line untrusted, as it should.  When a received 
line is valid the "received by" should be the same as the "received 
from" of the line above it.  This is simple enough to test 
programmatically.  You could no better determine if it were valid by 
manually parsing it.  And the worse mistake would be to send an 
untrustworthy address to SpamCop, which would be an undeserved 
denial-of-service against the forged address.

Also, in the end, spamitarium doesn't throw away any of the forged 
received line info... it merely prepends "untrusted-" to the front of 
each token.  This way, if it is merely a DNS snafu, if you train 
untrusted-bob.server.com as ham, it will still usefully record that in 
your wordlist.  If, however, you want to train untrusted-bob.server.com 
as spam, while bob.server.com is trained as ham, now you have that 
ability because you're using spamitarium.  Without spamitarium, you're 
training bob.server.com as both ham and spam depending on the email, 
which causes the token to be generally more neutral and may throw off 
classifications of some emails.  So information was ADDED, not removed, 
by spamitarium.

> I will be happy to reconsider my reluctance to use spamitarium-like
> functionality when you have written such a milter.  Let me know ;-).

I will alert the list.

Tom

P.S. I'm not selling anything.  It's no skin off my back if you fix your 
spam problem or not.  I'm just trying to help.