Bogofilter accuracy plummets starting around March 10, 2010

Jonathan Kamens jik at kamens.brookline.ma.us
Thu Apr 1 16:04:45 CEST 2010


Thomas Anderson wrote:
> It's important to have 
> your robx between your ham and spam cutoffs and preferably within your 
> min_dev... this way new tokens are graded neutrally and don't 
> immediately affect classifications.
I would assume that bogotune would take care of this, wouldn't it?  Here 
are my current settings, generated by bogotune with current ham and spam 
corpuses on March 22, i.e., almost two weeks after this problem started:

db_cachesize=78
robs=0.0100
min_dev=0.394
robx=0.595174
sp_esf=0.065025
ns_esf=0.017818
spam_cutoff=0.996938    # for 0.05% fp (2); expect 6.31% fn (398).
ham_cutoff=0.442

I'm still seeing well over 100 messages of this variety classified as 
ham or unsure every day, despite the fact that I actively retrain every 
message as it comes in.
> I also run spamitarium on my emails 
> prior to bogofilter, and this often tags the header with an SPF failure 
> or missing rDNS, which helps bump it into a spam classification.
I've never looked at spamitarium before, but I just took a look at it 
now, and I'm quite uncomfortable with the idea of throwing away 
non-standard headers.  On the one hand, I understand the argument that 
when included in spam, these headers are likely intended to throw off 
Bayesian filters, but on the other hand, I really don't like the idea of 
discarding data in messages that turn out to be spam -- I want the 
message that ends up in my inbox to be exactly what the sender intended 
it to be.  I think it would be preferable to add a configuration option 
to bogofilter to tell it to ignore headers with certain prefixes, and 
then to have spamitarium add those prefixes to the headers it believes 
should be ignored.

In addition, since I use bogofilter in a milter rather than in my 
delivery agent, it would be difficult for me to integrate spamitarium's 
functionality into my incoming mail flow.  I'd have to (a) switch from 
the milter to procmail, (b) write a milter for spamitarium, or (c) 
reimplement spamitarum's functionality inside the milter.  The time and 
energy necessary to do any of these are, alas, in short supply for me 
right now.

Thanks,

  jik




More information about the Bogofilter mailing list