Fwd: Is whitelisting possible?

Thomas Anderson tanderson at orderamidchaos.com
Mon Dec 6 20:21:08 CET 2010


I have two suggestions for you.  They're suggestions I've made many 
times to this list because the same basic questions keep coming up over 
and over again.  Usually it's from the perspective of spammers getting 
past into the ham or unsure range.  But the same principles apply when 
your hams fit the mold of an "encyclopedia spam" too, i.e. they contain 
many never- or little-seen words.  Without much concrete data to go on, 
bogofilter has no choice but to label them "unsure".

As David suggested, simply training many times will often ultimately 
result in the resolution to the issue, but this can take lots of time 
and patience and may never totally succeed.  So I have these further 
methods.

First of all, I recommend employing "spamitarium" in your procmail chain 
just before bogofilter:

http://orderamidchaos.com/bogofilter/spamitarium

It will process your email headers to remove superfluous lines which may 
contribute spammy or unsure tokens and it will perform look-ups to add 
additional tokens which will train as hammy.  E.g., spammers might 
pretend to be "amazon.com", but Spamitarium will do a reverse look-up on 
their IP and add tokens to differentiate the real amazon.com from fakes. 
  This benefits you two-fold, the fakes are more likely classified as 
spam and also the real ones are more likely classified as ham.

Secondly, I recommend training to exhaustion.  That is, when a false 
positive, false negative, or unsure shows up, first you train it, then 
you check it again as if the same exact email arrived another time, and 
if it still doesn't classify correctly, train it again -- repeat until 
it classifies correctly.  This is similar to David's "just keep 
training" suggestion, but this is the impatient method -- it does all of 
the repetitive training all at once.

I added the "x" command-line option to my helper application "bfproxy" 
in order to automate this process:

http://orderamidchaos.com/bogofilter/bfproxy

Using this exhaustive training method ensures that you will deal with as 
few similar emails over and over again.  Combined with spamitarium's 
header processing, you're much more likely only to have to classify 
email from a particular sender once for bogofilter to get it right.

Procmail usage instructions are included in those files.

Tom


On 12/4/2010 2:09 PM, Anne Wilson wrote:
> On Saturday 04 December 2010 18:20:19 you wrote:
>> ---------- Forwarded message ----------
>> From: David Relson<relson at osagesoftware.com>
>> Date: 4 December 2010 14:08
>> Subject: Re: Is whitelisting possible?
>> To: Anne Wilson<cannewilson at googlemail.com>
>> Cc: bogofilter at bogofilter.org
>>
>>
>> On Sat, 4 Dec 2010 10:13:29 +0000
>>
>> Anne Wilson wrote:
>>> I subscribe to Magnatune, so I get frequent emails from them
>>> describing new releases.  Every time the email ends up in the Unsure
>>> folder.  Every time, I copy it to the ham training folder, but the
>>> content is so variable, I think, that Bogofilter is never able to
>>> classify it properly.  Is there any way that I can add weighting that
>>> would make this into definitely ham?  Currently Bogofilter marks them
>>> with anything from 1% to 49.9999% probbility of being spam, with the
>>> majority being in the 47-49% range.
>>>
>>> Anne
>>
>> Hello Anne,
>>
>> 'Tis good to hear from one of my friends from my Mandrake days :->
>>
> :-D We go back a long way.  /me waves to Charles as well.
>
>> Patience is the key. I've seen the training process take a long
>> time.
>>
> I've been training on these for 18 months, so I think they must just be too
> confusing - perhaps not enough common vocabulary, and too high a proportion of
> URLs.
>
>> When the same junk mail comes to several users _I_ think it's spam,
>> though bogofilter may not recognize it as such.  So I train.  As more
>> copies of the same junk comes in, I keep training.  Over time the score
>> increases and the classification changes from ham to unsure and
>> (eventually) to spam.
>>
>> At present I have 2 messages that have progressed from ham to unsure.
>> I'm keeping up the training knowing that eventually they'll progress to
>> spam.
>>
> Yes, I train regularly on batches, and most times it works well.
>
>> I also have email from Border's books that is classified as spam.
>> Training with that will eventually move it to ham.
>>
> I guess that might have similar characteristics to the Magnatune ones.  My
> regular Amazon emails are recognised without a problem but the format is quite
> different.
>
>> Be patient :->
>>
> My attempts at implementing the whitelisting broke procmail.  It seems that I
> don't have formail installed.  I'll look at that again soon, but for the
> moment I've just moved the Magnatune recipe to run before Bogofilter sorts
> them.
>
> Anne
>
>
>
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter




More information about the Bogofilter mailing list