[bogofilter] subnets
Tom Allison
tallison at tacocat.net
Fri Apr 30 13:02:14 CEST 2004
I'm not sure where the philosophical lines for using (or not) the
subnets option, but it seems that it might have some very real potential.
I've always held onto the belief that someone could successfully
impliment something like pyzor/razor to provide some real added benefit
to spam filtering.
Unfortunately these tools have two flaws:
They use only the BODY for signature, but today spam can so easily be
modified with things like bayesian blocks (that's what I call those
blocks of 100+ words from the dictionary to spoof bayes filters) that
these signatures are not as effective as they could be.
I have no numbers for pyzor, but currently razor can't exceed 20%
consistently. IIRC, Vipul admitted that 24% is the best you'll get. He
also says the commercial product SpamNet has different algorithms and
can get much higher, but I haven't personally seen it. Besides, SpamNet
is a proprietary not-free-as-in-beer product.
I'm sure people have considered the value of having a distributed
bogofilter wordlist at some point in their lives (perhaps only under
heavy influences...). This could be either a mail-server wide list, or
perhaps going so weird as an internet wide list (but that's got 1,000's
of problems with it so we won't go there).
It seems possible, or possible enough to me to spend this time posting
it here, that the IP information ( not DNS, but aaa.bbb.ccc.ddd ) is
invariant enough to provide credable information to a distributed group
of people to indicate spam/ham. If I had to clarify, it would be all
the routable IP addresses found in the Received headers as points of
consideration.
I can't ask bogofilter development team to do anything with this because
I'm rather certain that most of them actually have lives and with the
coming of Spring in the northern half of the planet, life will probably
get in the way of coding more than usual.
But I am asking is this even makes sense.
If it seems reasonable enough to start looking into more feasable
studies, then it might make sense to collect bogofilter wordlist
information from other peoples ^url: listings to see if there is
sufficient and consistent overlap to provide a reliable means of detection.
My theory goes something like this:
If we can identify probabilities of spam/ham IP addresses with
sufficient accuracy and precision, then it seems reasonable that this
information could serve the basis as a variation of an automated RBL
list. But that depends on how the data is shared.
I suspect that through consistent scoring feedback, a lot of the high
volume spam-bots out there would be quickly identified.
I also suspect that this information of the IP addresses would be less
variable between users than the rest of their wordlist values.
More information about the Bogofilter
mailing list