Fwd: Is whitelisting possible?
Thomas Anderson
tanderson at orderamidchaos.com
Mon Dec 6 20:21:08 CET 2010
I have two suggestions for you. They're suggestions I've made many
times to this list because the same basic questions keep coming up over
and over again. Usually it's from the perspective of spammers getting
past into the ham or unsure range. But the same principles apply when
your hams fit the mold of an "encyclopedia spam" too, i.e. they contain
many never- or little-seen words. Without much concrete data to go on,
bogofilter has no choice but to label them "unsure".
As David suggested, simply training many times will often ultimately
result in the resolution to the issue, but this can take lots of time
and patience and may never totally succeed. So I have these further
methods.
First of all, I recommend employing "spamitarium" in your procmail chain
just before bogofilter:
http://orderamidchaos.com/bogofilter/spamitarium
It will process your email headers to remove superfluous lines which may
contribute spammy or unsure tokens and it will perform look-ups to add
additional tokens which will train as hammy. E.g., spammers might
pretend to be "amazon.com", but Spamitarium will do a reverse look-up on
their IP and add tokens to differentiate the real amazon.com from fakes.
This benefits you two-fold, the fakes are more likely classified as
spam and also the real ones are more likely classified as ham.
Secondly, I recommend training to exhaustion. That is, when a false
positive, false negative, or unsure shows up, first you train it, then
you check it again as if the same exact email arrived another time, and
if it still doesn't classify correctly, train it again -- repeat until
it classifies correctly. This is similar to David's "just keep
training" suggestion, but this is the impatient method -- it does all of
the repetitive training all at once.
I added the "x" command-line option to my helper application "bfproxy"
in order to automate this process:
http://orderamidchaos.com/bogofilter/bfproxy
Using this exhaustive training method ensures that you will deal with as
few similar emails over and over again. Combined with spamitarium's
header processing, you're much more likely only to have to classify
email from a particular sender once for bogofilter to get it right.
Procmail usage instructions are included in those files.
Tom
On 12/4/2010 2:09 PM, Anne Wilson wrote:
> On Saturday 04 December 2010 18:20:19 you wrote:
>> ---------- Forwarded message ----------
>> From: David Relson<relson at osagesoftware.com>
>> Date: 4 December 2010 14:08
>> Subject: Re: Is whitelisting possible?
>> To: Anne Wilson<cannewilson at googlemail.com>
>> Cc: bogofilter at bogofilter.org
>>
>>
>> On Sat, 4 Dec 2010 10:13:29 +0000
>>
>> Anne Wilson wrote:
>>> I subscribe to Magnatune, so I get frequent emails from them
>>> describing new releases. Every time the email ends up in the Unsure
>>> folder. Every time, I copy it to the ham training folder, but the
>>> content is so variable, I think, that Bogofilter is never able to
>>> classify it properly. Is there any way that I can add weighting that
>>> would make this into definitely ham? Currently Bogofilter marks them
>>> with anything from 1% to 49.9999% probbility of being spam, with the
>>> majority being in the 47-49% range.
>>>
>>> Anne
>>
>> Hello Anne,
>>
>> 'Tis good to hear from one of my friends from my Mandrake days :->
>>
> :-D We go back a long way. /me waves to Charles as well.
>
>> Patience is the key. I've seen the training process take a long
>> time.
>>
> I've been training on these for 18 months, so I think they must just be too
> confusing - perhaps not enough common vocabulary, and too high a proportion of
> URLs.
>
>> When the same junk mail comes to several users _I_ think it's spam,
>> though bogofilter may not recognize it as such. So I train. As more
>> copies of the same junk comes in, I keep training. Over time the score
>> increases and the classification changes from ham to unsure and
>> (eventually) to spam.
>>
>> At present I have 2 messages that have progressed from ham to unsure.
>> I'm keeping up the training knowing that eventually they'll progress to
>> spam.
>>
> Yes, I train regularly on batches, and most times it works well.
>
>> I also have email from Border's books that is classified as spam.
>> Training with that will eventually move it to ham.
>>
> I guess that might have similar characteristics to the Magnatune ones. My
> regular Amazon emails are recognised without a problem but the format is quite
> different.
>
>> Be patient :->
>>
> My attempts at implementing the whitelisting broke procmail. It seems that I
> don't have formail installed. I'll look at that again soon, but for the
> moment I've just moved the Magnatune recipe to run before Bogofilter sorts
> them.
>
> Anne
>
>
>
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
More information about the Bogofilter
mailing list