Spam in images
Tom Anderson
tanderso at oac-design.com
Wed Sep 6 20:19:51 CEST 2006
Eric Wood wrote:
> Umm. The way I read you is that your promotion for people to use inline
> images will yield a high spammy result. Having bogofilter train on image
> spam will cause poisoning and eventual false positives.
Only if you overdo it with how much this particular feature of the email
contributes to the score. With five or six mime or head tokens
indicating spamminess, this probably justifiably pushes spams into the
right bogosity. However, a legitimate ham will not likely be swayed by
these few tokens unless it consists only of images and few other hammy
tokens. Given sufficient hammy tokens, it should classify correctly.
Nonetheless, the first newsletter you subscribe to which uses inline
images may be classified unsure the first time you get it. After that,
these tokens will become less and less spammy.
> For example, one current problem I have is that my users have recently asked
> for their Ebay passwords to be changed. Ebay sends them an email
> confirmation in which the user never gets. This is because bogofilter saw
> it as spam. This is because of all the "fishing" scam emails look exactly
> like an official Ebay message including all the verbage and all but one
> legit url. Bogofilter trained (thanks to -u) on all these slightly various
> fishing emails. Now, when the legit email comes in - it can't help but
> classify it as spam.
Strangely enough, I get lots of ebay spams (most of which are filtered
fine, although some are unsure), but my ebay hams always come through
fine (I review for false positives, and there haven't been any). This
may be because my prefilter adds the [SPAM-ADDRESS] token or
[SCAM-ADDRESS] token to the spams. By now, most of the regular ebay
verbiage must be fairly neutral to only slightly hammy. The header info
and the URL flags push the spams over the edge.
> This "token poisoning" isn't a big deal for me yet. And I don't want it to
> get bad. But I fear that if I train bogofilter on image spams, I'll be
> caught in a never ending cycle of re-training. I rarely do any training.
I train exhaustively, which tends to prevent the never-ending cycles.
That is, with bfproxy, the script trains on an email over and over again
until it finally classifies correctly. It's as if you received the
email many, many times and trained it each time, but it's much faster
and easier this way. And this way, no tokens ever get so overbearing as
to take out great swaths of emails all alone. They tend to normalize.
> Hey, I'm with you and I'm not arguing or anything. I honestly do challenge
> my own thought processes and am open to other ideas.
Good to hear. I didn't mean to insult your filtering method. I just
wanted to point out what I saw as a potential problem.
Tom
More information about the Bogofilter
mailing list