[bogofilter] subnets

Fri Apr 30 13:02:14 CEST 2004

I'm not sure where the philosophical lines for using (or not) the 
subnets option, but it seems that it might have some very real potential.

I've always held onto the belief that someone could successfully 
impliment something like pyzor/razor to provide some real added benefit 
to spam filtering.

Unfortunately these tools have two flaws:

They use only the BODY for signature, but today spam can so easily be 
modified with things like bayesian blocks (that's what I call those 
blocks of 100+ words from the dictionary to spoof bayes filters) that 
these signatures are not as effective as they could be.

I have no numbers for pyzor, but currently razor can't exceed 20% 
consistently.  IIRC, Vipul admitted that 24% is the best you'll get.  He 
also says the commercial product SpamNet has different algorithms and 
can get much higher, but I haven't personally seen it.  Besides, SpamNet 
is a proprietary not-free-as-in-beer product.

I'm sure people have considered the value of having a distributed 
bogofilter wordlist at some point in their lives (perhaps only under 
heavy influences...).  This could be either a mail-server wide list, or 
perhaps going so weird as an internet wide list (but that's got 1,000's 
of problems with it so we won't go there).

It seems possible, or possible enough to me to spend this time posting 
it here, that the IP information ( not DNS, but aaa.bbb.ccc.ddd ) is 
invariant enough to provide credable information to a distributed group 
of people to indicate spam/ham.  If I had to clarify, it would be all 
the routable IP addresses found in the Received headers as points of 
consideration.

I can't ask bogofilter development team to do anything with this because 
I'm rather certain that most of them actually have lives and with the 
coming of Spring in the northern half of the planet, life will probably 
get in the way of coding more than usual.

But I am asking is this even makes sense.

If it seems reasonable enough to start looking into more feasable 
studies, then it might make sense to collect bogofilter wordlist 
information from other peoples ^url: listings to see if there is 
sufficient and consistent overlap to provide a reliable means of detection.

My theory goes something like this:
If we can identify probabilities of spam/ham IP addresses with 
sufficient accuracy and precision, then it seems reasonable that this 
information could serve the basis as a variation of an automated RBL 
list.  But that depends on how the data is shared.
I suspect that through consistent scoring feedback, a lot of the high 
volume spam-bots out there would be quickly identified.
I also suspect that this information of the IP addresses would be less 
variable between users than the rest of their wordlist values.