article on blocking by subnets - Justification
Barry Gould
BarryGould at PennySaverUSA.net
Thu Dec 5 20:09:59 CET 2002
Hi,
I've been trying to make sure that storing subnet addresses is worthwhile...
I've taken all the IP's and counts in my good and spam databases, grouped
them by class A, B, & C subnets (I use the term loosely), and made some
plots (attached) for the different subnets.
If anyone wants a copy of the data or my Excel file, let me know.
Observations:
On my data, for Class B & C nets, there are definitely strong correlations
between nets within the same class A or B net.
e.g. almost everything in 12.0.0.0 is ham, whereas almost everything from
180.x ... 189.x is spam.
(note the very large numbers of 0 and 1 probabilities on the B & C graphs)
Therefore, I think it IS a good idea to store the Class B, & C subnets as
tokens.
Class A is not quite as clear, but there are still a lot of points at 0 and
1. Perhaps someone more knowledgeable in statistics can advise?
In instances where a unique IP (not a subnet) appears in both ham & spam,
it seems the p(spam) is still quite high. I suspect that most of this ham
data is probably false negatives anyways.
Private networks: 10.0.0.0 - mostly ham ( p(spam)=0.29 ) ... 192.168 -
mostly spam (0.64)
Concerns:
I suppose lots of forged headers in spam _could_ throw off the probabilites.
More info on my data:
spamlist.db: 2.4MB (trained with several thousand spams from my own
corpora, as well as some I downloaded from sources mentioned on the
mailinglists)
goodlist.db: 8.8MB (trained with many thousand non-spams)
unique ip's in spamlist: 4134
unique ip's in goodlist: 18060
Class A nets: 254
Class B nets: 5356
Class C nets: 15399
A few of the IP's aren't really IP's, but instead look like phone numbers,
version numbers, etc, but there are so few that it is insignificant for
this demonstration. Better filtering could be applied in the code.
Thanks,
Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: collisions.png
Type: image/png
Size: 21662 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ClassC.png
Type: image/png
Size: 19922 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ClassB.png
Type: image/png
Size: 26370 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ClassA.png
Type: image/png
Size: 15724 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20021205/9d33cb1e/attachment-0003.png>
More information about the Bogofilter
mailing list