Yet another spam filter

Tom Anderson tanderso at oac-design.com
Wed May 19 15:45:28 CEST 2004


From: "Tom Allison" <tallison at tacocat.net>
> http://www.macdevcenter.com/pub/a/mac/2004/05/18/spam_pt2.html
>
> I guess the iMac has a pretty decent filtering application, but it's
> based on a different model of word recognition.

The author didn't handle his description of Bayesian filters very well.  In
reality, humans (and all animals) use a Bayesian model in their own decision
making, inference, and inductive logic, so summarizing it as a weighted
keyword system is a bit simplistic.  Also, when he says "most" email filters
this or that, he clearly isn't including statistical filters, only
rule-based filters.

Nonetheless, the idea of word vectors and clustering is appealing.  Not just
for spam filtering, but for content sorting and searching as well.  From
what I understand, the clustering aspect will allow your search for "Aunt
Emma's recipes" to receive results for "family cooking secrets" from Uncle
Jack, even though none of the search terms may be present.  I'm just not
sure it's the "ideal" method of making a binary decision for spam vs ham.
How many clusters might intersect an email from Aunt Emma discussing her
conversation with a doctor about the benefits of Viagra?  Using the
statistical approach, certain tokens such as Aunt Emma's email address, her
header tokens, and even common tokens in her style of speech, will
contribute disproportionately to the hamminess of the score.  But using
vectors, it would likely be clustered with Viagra spam most strongly.  Thus
the claim of only 98%+ accuracy, whereas statistical filters can usually get
99.5%+ (or as Dr. Yerazunis writes, 99.9%).  1 in 50 vs 1 in 1000 is a big
difference.  Maybe the vector approach can be improved for even better
results, but I think it is more useful in other respects.

The pairing of clustering with Bayesian-style decision making would make for
a very, very intelligent mail system -- one that could differentiate easily
a genuine solicited mortage offer or real-estate mailing list from a
mortgage spam, and sort them into appropriate folders based on topic.  It
would know not only is it spam or ham, but mortgage-related spam or
cooking-related spam, which would be very useful in scanning for false
positives.

Tom




More information about the Bogofilter mailing list