Spam in images

Wed Sep 6 20:19:51 CEST 2006

Eric Wood wrote:
> Umm.  The way I read you is that your promotion for people to use inline 
> images will yield a high spammy result.  Having bogofilter train on image 
> spam will cause poisoning and eventual false positives.

Only if you overdo it with how much this particular feature of the email 
  contributes to the score.  With five or six mime or head tokens 
indicating spamminess, this probably justifiably pushes spams into the 
right bogosity.  However, a legitimate ham will not likely be swayed by 
these few tokens unless it consists only of images and few other hammy 
tokens.  Given sufficient hammy tokens, it should classify correctly. 
Nonetheless, the first newsletter you subscribe to which uses inline 
images may be classified unsure the first time you get it.  After that, 
these tokens will become less and less spammy.

> For example, one current problem I have is that my users have recently asked 
> for their Ebay passwords to be changed.  Ebay sends them an email 
> confirmation in which the user never gets.  This is because bogofilter saw 
> it as spam.  This is because of all the "fishing" scam emails look exactly 
> like an official Ebay message including all the verbage and all but one 
> legit url.  Bogofilter trained (thanks to -u) on all these slightly various 
> fishing emails.   Now, when the legit email comes in - it can't help but 
> classify it as spam.

Strangely enough, I get lots of ebay spams (most of which are filtered 
fine, although some are unsure), but my ebay hams always come through 
fine (I review for false positives, and there haven't been any).  This 
may be because my prefilter adds the [SPAM-ADDRESS] token or 
[SCAM-ADDRESS] token to the spams.  By now, most of the regular ebay 
verbiage must be fairly neutral to only slightly hammy.  The header info 
and the URL flags push the spams over the edge.

> This "token poisoning" isn't a big deal for me yet.  And I don't want it to 
> get bad.  But I fear that if I train bogofilter on image spams,  I'll be 
> caught in a never ending cycle of re-training.   I rarely do any training.

I train exhaustively, which tends to prevent the never-ending cycles. 
That is, with bfproxy, the script trains on an email over and over again 
until it finally classifies correctly.  It's as if you received the 
email many, many times and trained it each time, but it's much faster 
and easier this way.  And this way, no tokens ever get so overbearing as 
to take out great swaths of emails all alone.  They tend to normalize.

> Hey, I'm with you and I'm not arguing or anything.  I honestly do challenge 
> my own thought processes and am open to other ideas.

Good to hear.  I didn't mean to insult your filtering method.  I just 
wanted to point out what I saw as a potential problem.

Tom