Bogofilter accuracy plummets starting around March 10, 2010

Tue Apr 6 02:19:51 CEST 2010

On 04/05/2010 05:20 PM, Thomas Anderson wrote:
> That is the point AFAIK, but I don't know how reliable it is.  I don't
> use it.  I cannot imagine it is better than managing them through
> experience over time.
>    
In fact, I believe it is just as good as, if not better than, anything 
you could do by hand.

It reads your entire ham and spam corpus and tries many, many different 
combinations of parameters to find the optimal settings to correctly 
identify the highest possible percentage of spam with the lowest 
possible false positive rate.  The amount of trial and error it does 
automatically in an hour is far more than a person could do in days or 
weeks.
> So if someone hacked into your computer and planted a rootkit, you'd be
> happy to leave things as is?  That's equivalent to what you're saying.
No, it's not, really, because once I've identified a rootkit and what 
variant it is, I can identify exactly what I need to change to remove 
it, and I can remove just the rootkit without losing any other valid 
bits on my machine.
> Everything is clearly stated in the documentation and user variables
> section.  No weeks or months needed.  If you want an X-Face header, add
> it either in the user variables or on the command line.  The mailing
> list fields are actually already keyed in for you... you just have to
> uncomment the line.
>    
The point is that I don't believe I can anticipate, in advance, 
up-front, every single header I'd want to preserve.

In fact, I'd argue that it's /impossible/ for me to anticipate, in 
advance, up-front, every single header I'd want to preserve, because 
anyone can add any arbitrary "X-" header they want at any given point in 
time.

Perhaps I'm a bit old-fashioned, I've been using email on the Internet 
for 23 years, and I remember a time when putting amusing and informative 
"X-" headers in one's email was far more common than it is nowadays.  
But people do still do it, and I like to see them occasionally, so I 
don't want to throw them away.

Not to mention the fact that it is in fact /permitted/ by the RFC's to 
put functional X-" headers into email messages, and some software 
packages actually use them for meaningful, useful information, such as 
tokens to prevent mail loops, thread identifiers, tickets identifiers 
for ticket-tracking systems, etc.

By the time I realize I've been throwing away a header that I wanted to 
save, it's too late to get it back.  It's gone forever in all the 
messages that came before.  This is why I say that I don't like throwing 
away information..
> And the worse mistake would be to send an
> untrustworthy address to SpamCop, which would be an undeserved
> denial-of-service against the forged address.
>    
You give SpamCop the whole header, and it figures out what to trust.  In 
my experience, it is quite conservative and never uses an invalid 
Received line, even if that sometimes means ignoring a valid one.  
However, if I were to start letting spamitarium muck with the headers 
and add extra tokens, I think there's a good chance they would confuse 
SpamCop's logic, and I'd no longer be able to report spam with SpamCop 
(I report any spam that gets through bogofilter to SpamCop automatically 
as part of my retraining process).

> Also, in the end, spamitarium doesn't throw away any of the forged
> received line info... it merely prepends "untrusted-" to the front of
> each token.
Thank you for the clarification.  That does make things better in my 
eyes, but it has the SpamCop problem described above.  It seems like if 
I started to use spamitarium, I would no longer be able to report spam 
to SpamCop.

Hmm, actually, I suppose I could un-spamitarium-ize the messages before 
submitting them to SpamCop.  Since all that happens automatically for 
me, it might be possible to do the reverse conversion.  I will have to 
think on this.
> P.S. I'm not selling anything.  It's no skin off my back if you fix your
> spam problem or not.  I'm just trying to help.
>    
Yes, I understand that, and I appreciate it.

Thanks,

   jik