Incorrigible spam

Tom Allison tallison at tacocat.net
Wed Apr 14 13:40:10 CEST 2004


Boris 'pi' Piwinger wrote:
> David Relson wrote:
> 
> 
>>>Successively registering these hams and spams until they each score
>>>correctly will polarize the difference while neutralizing the
>>>intersection.  This is precisely what we would want to achieve.
>>
>>pi has mentionned effects like that.  After a train-on-error pass,
>>additional passes will show "errors" that weren't in the original pass. 
>>Adding tokens to the wordlist does effect previous scores.  In most
>>cases the effect is very, very small.
> 
> <snip>
> 
> Anyhow, corrections can and do change values for other
> messages, this can move messages over cutoffs.
> 

This is easy to imagine if you run a business as an oral surgeon.
For me, "oral" is 99.99999% spam.
For the surgeon, it's probably best to ignore this word entirely and go 
look at other tokens.  With enough corrections one way and then the 
other, eventually "oral" will take on a value ~0.5 instead of my .9999

For example (taken from a false positive):
"and"                             4244  0.746088  0.539604  0.419720 +
<snip>
"from:aol.com"                     121  0.004941  0.036421  0.879390 +
"to:tacocat.net"                  2474  0.100467  0.745403  0.881170 +
"rcvd:64.12.137.9"                  10  0.000274  0.003182  0.905329 +
"rcvd:imo-m28.mx.aol.com"           10  0.000274  0.003182  0.905329 +
"head:sub"                         148  0.004666  0.046322  0.907442 +
"head:HTML_30_40"                  145  0.003843  0.046322  0.922282 +
"mime:html"                       1518  0.035959  0.490453  0.931580 +
"head:alternative"                1713  0.040077  0.554102  0.932454 +
"head:HTML_MESSAGE"               1404  0.031567  0.455799  0.935109 +


I don't get much HTML mail.
But if I did, I expect "mime:html" would be closer to the values for 
"and".  Similarly, the oral surgeon would probably end up with "oral" 
around 0.419 and "surgery" around 0.001.  The combination would put him 
"back in business"


I tried running everything through a correct to exhaustion process and 
found it had mixed results because my cutoffs where rather extreme 
(0.10, 0.95).  Changing those to different values works much better.

correct to exhaustion helps a lot, but I would use it sparingly.
the '-u' seems to help me more.  I'm still under 10K spam/ham each.





More information about the Bogofilter mailing list