Incorrigible spam

Tue Apr 13 07:14:10 CEST 2004

On Sun, 2004-04-11 at 09:46, Tom Allison wrote:

> Training on one bad spam repeatedly might induce higher scores on 
> similar email that's ham...  leading to false readings as well.

This does not happen in my experience.  My wordlist has more than enough
inertia to resist any major swing in the opposite direction.  Each
successive registration is only a nudge.  Each established token only
changes a fraction of a percent (usually slightly neutral) while new
tokens change more, as is the desired effect.

> I have a very long and ugly script that tests all my archives and 
> reports only the inconsistencies back to me.  If I get enough of them, I 
> run a set of teaching scripts that retrain every inconsistency a second 
> time.

Over-analysis IMHO.  Iff I were to get an unsure due to recent
registrations (which hasn't happened yet), I'd simply correct the new
email, not worry about old ones.  This correction will set right any
"inconsistency".

> I frequently correct three spam and then get 3-7 ham that start 
> reporting unsure.  By iterating over the known body repeatedly I negate 
> the effects of certain key phrases.

This appears nearly impossible with my wordlist.  I've never had a ham
score higher than 0.15.

> I wouldn't strip anything out since it's all of value.  For me, 90% of 
> my spam has Outlook in it somewhere.  99% of my spam is yahoo, aol, 
> hotmail.  But I know you have a different angle on this.

Well the testing bears out the truth of my hypothesis.  Stripping out
non-standard headers improves scores for both hams and spams. 
Bogofilter already strips out HTML comments... non-standard headers
should be stripped for exactly the same reason.

> I've even run into emails where I have to read it over a second time to 
> make sure it's really spam.  Hard to imagine, but some of it is getting 
> very tricky.

I have never had this happen.  A spam is clearly identifiable unless it
is extremely well targeted (an advertisement from a store you are highly
interested in buying from, but never signed up for), and doesn't play
any of the usual spammer tricks.  But if everyone on their list is that
well targeted, then is it really spam?

> What is ASN?

Autonomous System Number.  http://www.apnic.net/info/faq/as_faq.html  I
just recently learned about this after researching it based on a
discussion on this list.  ASN's represent various major subnets of the
internet, thus grouping all traffic from a particular region under a
common identity.  The benefit is that if you don't know anybody from
anywhere but a select few places on this planet, then you may be able to
identify spammers by this number alone.  I don't go quite that far
though... my "spamitarium" program simply looks up the ASN number and
adds it into the received line so that it is one additional token for
bogofilter to filter on.  Even if a spammer uses a hammy ISP and a hammy
webhost (eg AOL & Yahoo, having ASN's which might be classified has
hammy for many Americans), at least you know they are subject to the
same legal system.  Biasing my email against China, Africa, etc., at
least blocks more spam which I would have zero legal recourse against. 
It is similar in concept to bogofilter's block_on_subnets, but doesn't
require quite so many tokens as that.  It therefore gives useable
results more quickly with less registrations.

For example, in a recent spam I received, the received line looked like
this:

Received: from 216.109.145.120 ([218.191.29.135]) by oac-design.com
        (8.9.3/8.9.3) with SMTP id AAA12832 for
<tanderso at oac-design.com>; Sat, 3
        Apr 2004 00:26:32 -0500

This would normally translate to a fairly hammy received line since
216.109.145.120 is my own server, and thus ranked very hammy. 

"rcvd:216.109.145.120"            1877  0.031123  0.007172  0.187304 +

Spamitarium translates this line as follows:

Received: from helo-216.109.145.120 218.191.29.135 as9304 
          by oac-design.com 216.109.145.120 
          for <tanderso at oac-design.com>; Sat, 3 Apr 2004 00:26:32 -0500 

Now, look at these new tokens:

"rcvd:as9304"                       10  0.000000  0.000043  0.989412 +
"rcvd:helo-216.109.145.120"        160  0.000000  0.000684  0.999326 +

The helo- prepended version is extremely spammy since I never get email
from someone who uses my server as the helo string... my users emails
always originate from their home or office machines with my server as
the relay (pop before smtp).  And the ASN is also very spammy...

AS9304 belongs to:
HUTCHISON-AS-AP / Hutchison Telecom (HK) (website: www.apnic.net)
Control of approx 689,636 IP addresses (0.05%) in 28 groups
Issuer of approx 1,275 IP addresses (0.18%)
25 peers total (0.08%) (3 leaves)

The "HK" represents Hong Kong.  I don't know anybody in Hong Kong, and
doubtfully will ever do business with or converse with anyone in Hong
Kong.  Therefore a spammy ASN biases this and similar emails toward
spam.  I might not have received a 218.191 or 218 token before though,
so the ASN is better since it represents (in this case) 689,636 IP
addresses from the same geographical location rather than just a few
hundred or even a few thousand.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040413/f63c77e2/attachment.sig>