spam conference report

Sat Jan 18 22:59:55 CET 2003

Notes from the spam conference held at MIT, Boston, MA, USA on January 17th.

There are webcasts online at www.spamconference.org so this will be limited
to my impressions and how it relates to bogofilter.

It was much larger than I had expected:  according to the organizers, 580 people
attended. I got the impression that many of the attendees were not technical folks and many were from companies that do or want to provide spam prevention services. I suspect we'll be seeing more companies entering the market.

Things we should consider doing.

Bill Yerazunis talked about the CRM114 Discriminator (crm114.sf.net) which,
among other things, counts multiple words in addition to single words. He claims
that this increases accuracy and Paul Graham later remarked that he might add that feature to his filter. I think this even help with the HTML-comment-separating-word-components problem, though there is a better way.

John Graham-Cumming, creator of POPFile (popfile.sf.net) talked about some of the tricks that spammers use to hide from filters. They include:

1. The message consists of images and urls and no text. I have seen received spam of this sort myself, and IIRC, bogofilter did not catch it.

2. Using decoy keywords in an 'X-*' email header or html keywords section.
The intent here is to throw off our spamicity counts.

3. Using html comments to break up words. This we have seen.

4. Place the message in multiple horizontally arranged html tables. John characterized this as 'dastardly' and I have to agree.

5. The message is in MIME multipart/alternative format with text and html sections. The text contains a very large, innocuous document while the html contains the spam.  Obviously the spammer is counting on the MUA displaying the html rather than the text. My initial solution for this is to score the parts separately and compare the scores. More on that later.

6. encode urls in decimal, hex or octal format and, optionally, multiple the result by some multiple of 256 since popular clients perform base 256 arithmetic on ip addresses.

7. the message is in html format in which the body part is empty, but the body tag contains an 'OnLoad' attribute which references a piece of JavaScript that writes into the body. Worse yet, the message is encoded within the javascript,  and decoded when written.

8. Replace some letters with numbers

9. Use accented characters.

10. Add large random words, apparently designed to throw off 
    message checksums for techniques like Vipul's Razor

We can deal with some of these by 
1. parsing the html into plain text just like 'lynx -dump' would do
2. converting all text into a canonical format. This will foil charset trickery
3. If there are alternative formats, compare then for similarity for some definition of similar

Paul Graham spoke what he's learned since the paper. They include:

1. use header tags to increase accuracy. "Subject:spam Word" are stored as "Subject:spam" and "Subject:Word". 
2. Maintain the original case (see above).
3. Parse html elements for attributes.
4. Don't exclude [^[:alnum]] characters. This has the interesting result that 'spam!!" and "spam!!!" are different words. A suggestion for unknown words is to look for the closest match so "spam!!!!" will return the count for "spam!!!"

We could extend #4 by storing a word twice: once in original case, with all characters and once in canonical format (lower case, alphanumeric) and if a word is not found, look for its canonical representation. A further extension, solving Graham-Cumming's #5, is to store word phrases as well as words.  Essentially, for every n words are stored as a single database entry. Yes, this increases database size. 

As an optimization, we could stem the canonical copies too. This may be a problem for non English messages though.

Conclusions:

1. There are too many people working on different versions of statistical filters and IMNSHO, too much overlap and wasted effort.

2. Statistical filtering will catch a lot of spam, but not all. To be truly successful, one needs a layered approach where statistics only comprise a component.As a framework, SpamAssasin comes closest to this.  

3. It has become clear than as the number and variety of users increase, the spamicity counts decrease since different people have different levels of tolerance for spam. Our approach, therefore, will be less effective for large sites unless we create segmented databases of some sort.

-Gyepi