A strange thought

Matthias Andree matthias.andree at gmx.de
Fri Jan 31 16:39:56 CET 2003


On Fri, 31 Jan 2003, Tom Allison wrote:

> I get a lot of mail from addresses like:
> 
> qwer at specialoffers4you.com
> ewrt at specialoffers4you.com
> rtyu at specialoffers4you.com
> 
> and so on...  where the username is randomly generated and 
> modified, but the domain portion of the email is consistent.

The easy part is to use the access list of your mail server to nuke the
whole domain. However, bogofilter already takes this apart:

$ echo 'qwer at specialoffers4you.com' | bogolexer
normal mode.
get_token: 1 'qwer'
get_token: 1 'specialoffers4you.com'
2 tokens read.

So you see, the domain was scored separately, and will be indicative of
spam quick if you register these three sender addresses.

I registered these three sender addresses and then tried another coined
address. While it doesn't figure the newly-made up part, the
specialoffers4you.com part is scored as spam.

echo 'yuxc at specialoffers4you.com' | bogofilter -v
X-Bogosity: Yes, tests=bogofilter, spamicity=0.999805, version=0.10.1.4.cvs.20030131
                                     n     pgood      pbad        fw
invfwlog     fwlog U
"yuxc"                               0  0.000000  0.000000  0.415000 -0.53614  -0.87948 -
"specialoffers4you.com"              3  0.000000  0.000886  0.999805 -8.54284  -0.00019 +
P_Q_S_invs_logs_md                   1   0.99981   0.00019  0.999805 -8.543    -0.000 0.10

> And I'm wondering if this domain pattern matching is something 
> that could be done well with a bayesian statistical approach.

Sort of, it will be one token of many.

> Besides the facts that I'm ignoring a ton of additional 
> functionality and a ton of addition information (BODY), does it 
> seem reasonable that domain matching might be a useful approach to 
> identifying entire domains with greater certainty, regardless of 
> any additional efforts the spammers use.

Yes, and we are already there :-)




More information about the Bogofilter mailing list