What is spam? (was: [bogofilter] ESF and redundancy)

Tom Anderson tanderso at oac-design.com
Tue May 11 16:23:08 CEST 2004


From: "Boris 'pi' Piwinger" <3.14 at piology.org>
> I don't subscribe to this point of view. I need the filter
> to block unwanted mail of every kind. I do use a virus

I concur.  Bogofilter needs to be able to filter any kind of unwanted mail.  If I register all emails from the bogofilter mailing list, then it should filter this list, no questions asked.  Virii spam are still spam... they are sent en masse to unwilling recipients with a payload, but instead of a marketing message, the payload is a virus (read this in the voice of Agent Smith ;).  They should be filterable.

> luckily, it works, and my error rate is in the magnitude of
> one in a thousand.

Good for you.  I wish I could get that.  I still get a virus spam every day or two, sometimes several in a day, usually classified as unsure.  Since I've started using ASNs, it's been getting better.  It's hard for that one token to push it up over my cutoff though, so they often still end up as unsure.  Here's the breakdown of one of them (use a fixed-width font):

X-Bogosity: Yes, tests=bogofilter, spamicity=0.468896, version=0.17.5

n                                pgood     pbad      fw     U
"mime:application"                 332  0.020702  0.000771  0.036174 +
"mime:Content-Disposition"         486  0.026857  0.001232  0.044046 +
"mime:attachment"                  305  0.016786  0.000775  0.044432 +
"mime:octet-stream"                284  0.015387  0.000729  0.045550 +
"mime:base64"                      401  0.016646  0.001182  0.066503 +
"document"                         853  0.028955  0.002708  0.085612 +
"mime:bit"                         937  0.030774  0.003006  0.089056 +
"attached"                        1284  0.039586  0.004196  0.095897 +
"rcvd:Mar"                        5390  0.138341  0.018448  0.117676 +
"mime:plain"                      2452  0.050497  0.008765  0.147932 +
"rcvd:216.109.145.120"            7179  0.133445  0.026094  0.163569 +
"mime:charset"                    1776  0.032172  0.006481  0.167695 +
"head:mixed"                      2957  0.049937  0.010899  0.179171 +
"format"                         19217  0.213037  0.074171  0.258251 +
"mime:text"                       4484  0.049098  0.017325  0.260838 +
"mime:Content-Type"               4458  0.047419  0.017266  0.266936 +
"mime:Content-Transfer-Encoding"  4384  0.045461  0.017015  0.272351 +
"subj:website"                      45  0.000420  0.000176  0.296278 +
"MIME"                           23202  0.168275  0.092217  0.354011 -
"multi-part"                     17427  0.118198  0.069510  0.370308 -
"This"                           72541  0.474752  0.289855  0.379091 -
"Your"                           38280  0.235557  0.153406  0.394397 -
"rcvd:oac-design.com"           217650  0.939712  0.884200  0.484782 -
"mime:Windows-1252"                166  0.000699  0.000675  0.491046 -
"head:Date"                     244940  1.000560  0.996772  0.499052 -
"rcvd:from"                     244922  0.998461  0.996760  0.499574 -
"rcvd:for"                      222991  0.892433  0.908005  0.504325 -
"message"                       118798  0.470695  0.483880  0.506906 -
"head:Message-Id"               133397  0.487341  0.544578  0.527733 -
"rcvd:tanderso"                 203455  0.692684  0.832099  0.545716 -
"rcvd:Wed"                       45157  0.151909  0.184740  0.548760 -
"head:Content-Type"             235179  0.781508  0.962420  0.551869 -
"to:oac-design.com"             223169  0.506085  0.920329  0.645205 -
"head:multipart"                129453  0.290670  0.533939  0.647506 -
"head:MIME-Version"             212537  0.467898  0.876906  0.652070 -
"to:tanderso"                   205168  0.337110  0.849935  0.716009 +
"subj:Your"                      17893  0.026997  0.074196  0.733212 +
"from:lovebreeze.com"                1  0.000000  0.000004  0.910000 +
"rtrn:lovebreeze.com"                1  0.000000  0.000004  0.910000 +
"mime:your_website.pif"              2  0.000000  0.000008  0.950909 +
"rcvd:24.7.114.120"                 10  0.000000  0.000042  0.989412 +
"rcvd:helo-oac-design.com"         177  0.000000  0.000742  0.999391 +
"rcvd:as6478"                      519  0.000000  0.002176  0.999792 +
N_P_Q_S_s_x_md                      26  7.20e-02  9.74e-03  4.69e-01
                                        2.00e-01  4.60e-01  0.200
                                                                             
As you can see, some tokens such as "document" and "attached" are hammy, however I doubt I've ever received a ham that said "Your document is attached."  And yet, some variation of this (ie "Your file is attached", etc.) is seen in these virus spams all the time.  With a Markovian filter, the 3-4 token phrase would be exponentially more relevant than the individual tokens.

Also of note, even though I've stripped out the non-standard headers with spamitarium, it's still largely the "administrative" tokens which make this email seem hammy.  Dates are especially frustrating... I wish bogofilter would ignore them.  I would strip them with spamitarium if they weren't a required part of the spec and used extensively by email clients for sorting and such.  Removing "X-Priority", "X-MSMail-Priority", "ESMTP", etc., has helped a bit.  Adding "helo-oac-design.com" and "as6478" helped a lot.  Without spamitarium, this email was scored at 0.067239.  Nonetheless, even at 0.468896, it still gets classified as "unsure".  I need something more to overcome the hamminess of the "mime:" tokens.  Perhaps simply registering this exhaustively until all of those tokens become neutral is the answer.  However, the Markovian method is also tempting.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040511/e9f4280d/attachment.html>


More information about the Bogofilter mailing list