Stripsearch

Tom Anderson tanderso at oac-design.com
Tue Jun 21 18:19:26 CEST 2005


----- Original Message ----- 
From: "Chris Fortune" <cfortune at telus.net>
> Using the Stripsearch utility for a week now.  It's eating a tolerable 
> amount of resources, (mostly regexes and waiting for network
> responses from the RBL's), but I found a tremendous increase in accuracy 
> deleting spam with hammy text and a single <IMG> or <A
> HREF> tag.  Thanks.  False positives? The SPAM-ADDRESS token is so potent 
> it is wiping out borderline "newsletters" sent from spammy
> neighbourhoods.  No complaints yet.   But what do we do about out and out 
> errors from the RBL's?

I've also had a noticeable result.  The last false negative I received was 
on Sunday at 11:32am.  The last unsure message I received was Monday morning 
at 7am.  Since then, 620+ spams were removed with DNSBLs at SMTP time, 37 
spams were caught by bogofilter, and of those, 22 had the SPAM-ADDRESS token 
inserted at least once.  141 hams were received.  No false positives. 
That's an accuracy of 99.87% and growing.  I used to get several unsures per 
day.

If newsletters are identified in RBLs, you should probably notify the owner 
so that they are removed.  Newsletters sometimes end up in RBLs when people 
report as spam newsletters they forgot they signed up for, but usually only 
if unsubscribes are routinely not honored.

> SCAM-ADDRESS: http://paypal.com/u.d?JFC0XA9JcUHbvox=111  (from one of 
> PayPal's unhelpful yet hammy "annual notices as required by
> federal law")
>
> registering this email depotentizes the token I guess.  Just some idle 
> thoughts.

The SCAM-ADDRESS token is not generated from RBLs, it is generated when the 
href and the visible text do not have the same domain.  If you can send me 
both of these URLs, I'll check to see if this is a bug.  BTW, make sure you 
have the latest version (1.0.5), as I've made a few bug fixes already.

I have had several hams get tagged with SPAM-ADDRESS or SCAM-ADDRESS tokens, 
but it has not affected scoring.  For example, I sent my resume to a 
potential employer (and cc'd myself), and one of my previous employers is on 
a few block lists since they do lots of bulk mailing.  Even with the token 
inserted, the email was still scored as 0.000000.  These tokens are most 
potent when there is little other information in the email, as is usually 
the case with the various pharmaceutical and software emails which only 
contain an image and maybe a few dictionary words which are often neutral. 
A genuine ham message usually has lots of hammy tokens unless all you say is 
"hey, check out this site" and it happens to be in a block list -- but even 
a human would be hard-pressed to classify that one correctly.

All in all, it's working out great so far.  I just hope I can find the time 
sometime soon to integrate it with spamitarium.

Tom




More information about the Bogofilter mailing list