The Risk of Spam Complaints

Mon Oct 21 12:33:10 CEST 2002

Rob Hill wrote:
> On Mon, Oct 21, 2002 at 10:18:02AM +0200, Boris 'pi' Piwinger spake thusly:
> 
>>Hi!
>>
>>I just got a false positive. It was a spam complaint I wrote, of
>>course, including the original spam (quoted). I bcc'ed the address the
>>spam was delivered to.                                               
>>
>>Now clearly that mail of mine contained all the bad words. So I had to
>>-N it. But then this makes the bad word better again. I don't have a 
>>solution to this, though.
>>
>>pi
>>
> 
> 
> I've been wondering about this too - especially with regards to
> postmaster email (which I'm bypassing for now). Often postmaster mail
> (which is completely legit) has the body of a spam email in it - should
> this be marked as spam or non-spam?
> 
> Also, another question if you will - most users don't have a 'bounce' or
> 'resend' facility on their mailer - so if they _forward_ spam to a
> bogofilter alias to have it marked as spam, the forwarded mail contains
> many headers that are legitimate, including the from address of the
> sender etc... (and Subject: Fwd: etc...).
> 
> Will these mails not 'corrupt' the corpus?
> 
> Has anyone come across this before?
> 
> Thanks,
> 
> Rob
> 
> 

(This is probably my first post and may be way out of line, but here it goes...)

I have been thinking about this, but not in the context of what bogofilter 
can do about it, but in the context of what an email admin might be able to 
do about it.

I would think that this first problem is best handled through something like 
procmail, to eliminate bogofiltering email that originates in the same domain 
as the destination (FROM = TO).  At least for now.  I suppose that bogofilter 
could be modified to do the same thing if it makes sense.

As for your comment about the bounce and users.  I thought of this right away 
as a huge problem with making corrections (-N, -S) to the data when someone 
is on a POP server email system.  By the time I read it, it's already off the 
server and if I forward it back to any address, the headers (and body) are 
corrupt.

The only idea I have on this right now is to create a record of messages and 
to hold a short archive (XX days).  If that message appears as sent to 
another address, say spam at ... or nospam at ... is would be pulled from the 
archive and processed.

But I'm not sure how to TAG these emails unless I add something like a unique 
key:  X-bogofilter-ID: xxxx  where xxxx equals what?  There are probably 500 
things wrong with this approach.  But it's all I can come up with for the 
time being.
One problem is the disk space needed to keep all these emails around.
One solution is to save the bogofilter relevant data, sort of a hash of words 
and adjustments to be made for that email.  Readers digest version of the 
email if you will.  That cuts down on space.  Then when the correction was 
needed, you would only dump those data elements into the 'corpus'.

But you have to come up with a utility to feed raw data into the bogofilter 
db and I personally don't know the format/structure to do that.  Oh yeah, and 
I do perl, not C, so I'm kind of lost...

But that's what I was thinking about this weekend.

-- 
	A new chef from India was fired a week after starting the job.  He
kept favoring curry.

For summay digest subscription: bogofilter-digest-subscribe at aotto.com