Bogofilter accuracy plummets starting around March 10, 2010

Sun Apr 4 15:37:12 CEST 2010

On Thu, 2010-04-01 at 10:04 -0400, Jonathan Kamens wrote:
> min_dev=0.394
> robx=0.595174
> spam_cutoff=0.996938    # for 0.05% fp (2); expect 6.31% fn (398).
> ham_cutoff=0.442

That seems OK in regard to 0.5-min_dev < robx < 0.5+min_dev <
spam_cutoff.  However, your cutoffs seem way to high to me.  You should
have ham_cutoff < 0.5-min_dev, otherwise tokens which are in that unsure
range are nonetheless classifying hams.  I don't use bogotune and I'm
not familiar with how it calculates these figures, but mine evolve over
time by observing bogofilter behavior and manually adjusting them.  Here
are mine:

min_dev=0.25,
robx=0.65,
spam_cutoff=0.75,
ham_cutoff=0.12

Spams vary far too much to be constrained to only 0.4% of scoring.  It
seems like you will naturally get lots of unsures and false negatives
with those numbers.  Hams tend to score much more tightly, so they don't
need as big a range.

> I'm still seeing well over 100 messages of this variety classified as 
> ham or unsure every day, despite the fact that I actively retrain every 
> message as it comes in.

That doesn't surprise me given your cutoffs.  If nothing else, lower
your spam cutoff to 0.5+min_dev.

> I've never looked at spamitarium before, but I just took a look at it 
> now, and I'm quite uncomfortable with the idea of throwing away 
> non-standard headers.  On the one hand, I understand the argument that 
> when included in spam, these headers are likely intended to throw off 
> Bayesian filters, but on the other hand, I really don't like the idea of 
> discarding data in messages that turn out to be spam -- I want the 
> message that ends up in my inbox to be exactly what the sender intended 
> it to be.  I think it would be preferable to add a configuration option 
> to bogofilter to tell it to ignore headers with certain prefixes, and 
> then to have spamitarium add those prefixes to the headers it believes 
> should be ignored.

If they're not in the RFC and you don't need them for your particular
setup (you can include/exclude any headers you like), then why would you
need them?  In any event, that's configurable.  Also, why would you want
a received line which is clearly forged?  I understand the concept of
training whatever you get and let the statistics do their job, but
spammers intelligently include lines chosen because of their likely
hamminess.  They'll use an AOL or gmail received line because most of
your friends and family probably have similar received lines.  So you
can train the hell out of them until they're more neutral, but it
doesn't make the spams much less hammy and it makes your hams more
spammy.  So stripping them out seems to be the best solution when
they're forged.

> In addition, since I use bogofilter in a milter rather than in my 
> delivery agent, it would be difficult for me to integrate spamitarium's 
> functionality into my incoming mail flow.  I'd have to (a) switch from 
> the milter to procmail, (b) write a milter for spamitarium, or (c) 
> reimplement spamitarum's functionality inside the milter.  The time and 
> energy necessary to do any of these are, alas, in short supply for me 
> right now.

I also use a milter, but with more lenient tolerances because I don't
want to reject at SMTP any false positives at all.  I run spamitarium
and bogofilter again (via Procmail) on anything which gets through the
milter.  In the future, I'll build a milter which includes spamitarium,
so that'll become easier.

Tom