ideas, what to do? [was: image-only spam ... ]

David Relson relson at osagesoftware.com
Thu Dec 14 13:34:11 CET 2006


On Wed, 13 Dec 2006 19:53:14 -0800
John Villalovos wrote:

> On 12/12/06, Tom Anderson <tanderso at oac-design.com> wrote:
> > John Villalovos wrote:
> > > I add this to my /etc/procmailrc.  It adds a header if there is
> > > an inline image.
> > >
> > > :0 HB
> > > # If it has an inline image, put in a header to indicate so.
> > > * src=(3D)?\"cid:.*@.*\"
> > > {
> > >     :0 fwh
> > >     # Make sure space at end of header.
> > >     | formail -I"X-Inline-Image: "
> > > }
> >
> > I'm not sure that's even necessary.  Bogofilter already scores
> > "src", "mime:image", "mime:Content-ID", "baseline", "mime:gif",
> > "head:related", and other inline-image-related tokens.  It
> > certainly can't hurt to add one more though.
> 
> You are probably correct, looking at a message that ended up in my
> unsure folder:
>   "src"                              6133  0.240113  0.480200
> 0.666647 - "head:related"                     2309  0.078569
> 0.194878  0.712650 - "mime:image"                       1640
> 0.019196  0.182011  0.904542 + "mime:Content-ID"
> 1606  0.016469  0.181012  0.916550 +
> "mime:gif"                         1505  0.015315  0.169769  0.917195
> + "baseline"                          608  0.001259  0.074453
> 0.983212 + "head:X-Inline-Image"               476  0.000839
> 0.058463  0.985641 +
> 
> So I'm not sure that my additional header is really making a big
> difference or not.

Hi John,

Most likely, your additional header doesn't make a big difference :-<
Big differences are hard to come by.  While taking my morning shower I
thought a bit about this subject and some big differences that have
occurred in the fight against spam.

Rules based filtering, e.g. spamassassin, provided a big difference,
especially since there was a lack of effective filtering before that.

Bayesian filtering, e.g. bogofilter, spambayes, etc, provided a big
difference as rules based filtering became less effective.

During bogofilter's early development several capabilities provided big
improvements.  Prefixes for header tokens resulted in different scores
for a token that was body only or header only, with additional
qualification for From line, subject line, etc.  HTML processing with
its use of certain tags for scoring and its ignoring of html comments
was a biggie. Multi-part mime processing was another biggie since the
mime headers are significant.  Charsets, particularly unicode (UTF-8),
was another helper.  

These biggies all have a common characteristic -- each one adds a
significant group of tokens to the wordlist and this provides a large
amount of information for scoring a message as ham or spam.  Since
bogofilter now has a rich wordlist to use for scoring, small
additions are unlikely to have a big effect.

There are several things that one can do that might be significant:

Possibility #1:  A message is composed of header and body.  They could
be scored separately and the score most different from 0.5 could be the
message score.  For example if headers were 0.95 and body was 0.30 the
message would be spam while headers 0.75 and body 0.20 would be ham.

Possibility #2:  A message with multi-part mime content could score
each mime section separately.  The most extreme section, i.e.
the one scoring furthest from 0.5, could be combined with the header
to produce the message score.

Bogofilter's newest capability, multi-word tokens, was initially
implemented by an ISP and found effective.  For example, using double
word tokens the phrase "big difference" becomes 3 tokens, i.e. "big",
"difference", and "big*difference".  Word combinations provide a
measure of meaning and context within the message that you don't
have with single word tokens.  Using double word tokens roughly doubles
the number of tokens in a message and has a comparable effect on the
wordlist and processing time.  If you want to go wild with this
capability, it supports "n" word tokens, i.e. you can set the multiple
as high as you want, i.e. 2, 3, 5, 10, ...

Personally I'm still using single word tokens.  My incoming spam load
has increased by 185% since this spring and bogofilter is still doing
well, though image spam has increased the number of unsures.  This
month, I'm seeing 7 unsures per 1000 spam with most of them being
offers of "Office 2007 for $79".  Last month it was a different
subject causing trouble.

To summarize, big changes have big effects and small changes have small
ones.  Such is life ...

Regards,

David



More information about the Bogofilter mailing list