spam conference report
Gyepi SAM
gyepi at praxis-sw.com
Sat Jan 18 23:49:46 CET 2003
Hi,
I know that those of you on the bogofilter-dev list have already received a
copy of this message. I think it's of sufficient general interest that I'm
redirecting a copy of it to the bogofilter list.
David
Notes from the spam conference held at MIT, Boston, MA, USA on January 17th.
There are webcasts online at www.spamconference.org so this will be limited
to my impressions and how it relates to bogofilter.
It was much larger than I had expected: according to the organizers, 580
people
attended. I got the impression that many of the attendees were not
technical folks and many were from companies that do or want to provide
spam prevention services. I suspect we'll be seeing more companies entering
the market.
Things we should consider doing.
Bill Yerazunis talked about the CRM114 Discriminator (crm114.sf.net) which,
among other things, counts multiple words in addition to single words. He
claims
that this increases accuracy and Paul Graham later remarked that he might
add that feature to his filter. I think this even help with the
HTML-comment-separating-word-components problem, though there is a better way.
John Graham-Cumming, creator of POPFile (popfile.sf.net) talked about some
of the tricks that spammers use to hide from filters. They include:
1. The message consists of images and urls and no text. I have seen
received spam of this sort myself, and IIRC, bogofilter did not catch it.
2. Using decoy keywords in an 'X-*' email header or html keywords section.
The intent here is to throw off our spamicity counts.
3. Using html comments to break up words. This we have seen.
4. Place the message in multiple horizontally arranged html tables. John
characterized this as 'dastardly' and I have to agree.
5. The message is in MIME multipart/alternative format with text and html
sections. The text contains a very large, innocuous document while the html
contains the spam. Obviously the spammer is counting on the MUA displaying
the html rather than the text. My initial solution for this is to score the
parts separately and compare the scores. More on that later.
6. encode urls in decimal, hex or octal format and, optionally, multiple
the result by some multiple of 256 since popular clients perform base 256
arithmetic on ip addresses.
7. the message is in html format in which the body part is empty, but the
body tag contains an 'OnLoad' attribute which references a piece of
JavaScript that writes into the body. Worse yet, the message is encoded
within the javascript, and decoded when written.
8. Replace some letters with numbers
9. Use accented characters.
10. Add large random words, apparently designed to throw off
message checksums for techniques like Vipul's Razor
We can deal with some of these by
1. parsing the html into plain text just like 'lynx -dump' would do
2. converting all text into a canonical format. This will foil charset trickery
3. If there are alternative formats, compare then for similarity for some
definition of similar
Paul Graham spoke what he's learned since the paper. They include:
1. use header tags to increase accuracy. "Subject:spam Word" are stored as
"Subject:spam" and "Subject:Word".
2. Maintain the original case (see above).
3. Parse html elements for attributes.
4. Don't exclude [^[:alnum]] characters. This has the interesting result
that 'spam!!" and "spam!!!" are different words. A suggestion for unknown
words is to look for the closest match so "spam!!!!" will return the count
for "spam!!!"
We could extend #4 by storing a word twice: once in original case, with all
characters and once in canonical format (lower case, alphanumeric) and if a
word is not found, look for its canonical representation. A further
extension, solving Graham-Cumming's #5, is to store word phrases as well as
words. Essentially, for every n words are stored as a single database
entry. Yes, this increases database size.
As an optimization, we could stem the canonical copies too. This may be a
problem for non English messages though.
Conclusions:
1. There are too many people working on different versions of statistical
filters and IMNSHO, too much overlap and wasted effort.
2. Statistical filtering will catch a lot of spam, but not all. To be truly
successful, one needs a layered approach where statistics only comprise a
component.As a framework, SpamAssasin comes closest to this.
3. It has become clear than as the number and variety of users increase,
the spamicity counts decrease since different people have different levels
of tolerance for spam. Our approach, therefore, will be less effective for
large sites unless we create segmented databases of some sort.
-Gyepi
More information about the Bogofilter
mailing list