article on blocking by subnets - Justification

Barry Gould BarryGould at PennySaverUSA.net
Thu Dec 5 20:09:59 CET 2002


Hi,

I've been trying to make sure that storing subnet addresses is worthwhile...

I've taken all the IP's and counts in my good and spam databases, grouped 
them by class A, B, & C subnets (I use the term loosely), and made some 
plots (attached) for the different subnets.

If anyone wants a copy of the data or my Excel file, let me know.

Observations:

On my data, for Class B & C nets, there are definitely strong correlations 
between nets within the same class A or B net.
e.g. almost everything in 12.0.0.0 is ham, whereas almost everything from 
180.x ... 189.x is spam.
(note the very large numbers of 0 and 1 probabilities on the B & C graphs)
Therefore, I think it IS a good idea to store the Class B, & C subnets as 
tokens.

Class A is not quite as clear, but there are still a lot of points at 0 and 
1. Perhaps someone more knowledgeable in statistics can advise?

In instances where a unique IP (not a subnet) appears in both ham & spam, 
it seems the p(spam) is still quite high. I suspect that most of this ham 
data is probably false negatives anyways.

Private networks:  10.0.0.0 - mostly ham ( p(spam)=0.29 ) ... 192.168 - 
mostly spam (0.64)


Concerns:

I suppose lots of forged headers in spam _could_ throw off the probabilites.



More info on my data:
spamlist.db: 2.4MB      (trained with several thousand spams from my own 
corpora, as well as some I downloaded from sources mentioned on the 
mailinglists)
goodlist.db:  8.8MB     (trained with many thousand non-spams)

unique ip's in spamlist:        4134
unique ip's in goodlist:        18060

Class A nets:   254
Class B nets:   5356
Class C nets:   15399

A few of the IP's aren't really IP's, but instead look like phone numbers, 
version numbers, etc, but there are so few that it is insignificant for 
this demonstration. Better filtering could be applied in the code.

Thanks,
Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: collisions.png
Type: image/png
Size: 21662 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ClassC.png
Type: image/png
Size: 19922 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ClassB.png
Type: image/png
Size: 26370 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ClassA.png
Type: image/png
Size: 15724 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment-0003.png>


More information about the Bogofilter mailing list