spam conference report

Sat Jan 18 23:49:46 CET 2003

Hi,

I know that those of you on the bogofilter-dev list have already received a 
copy of this message.  I think it's of sufficient general interest that I'm 
redirecting a copy of it to the bogofilter list.

David

Notes from the spam conference held at MIT, Boston, MA, USA on January 17th.

There are webcasts online at www.spamconference.org so this will be limited
to my impressions and how it relates to bogofilter.

It was much larger than I had expected:  according to the organizers, 580 
people
attended. I got the impression that many of the attendees were not 
technical folks and many were from companies that do or want to provide 
spam prevention services. I suspect we'll be seeing more companies entering 
the market.

Things we should consider doing.

Bill Yerazunis talked about the CRM114 Discriminator (crm114.sf.net) which,
among other things, counts multiple words in addition to single words. He 
claims
that this increases accuracy and Paul Graham later remarked that he might 
add that feature to his filter. I think this even help with the 
HTML-comment-separating-word-components problem, though there is a better way.

John Graham-Cumming, creator of POPFile (popfile.sf.net) talked about some 
of the tricks that spammers use to hide from filters. They include:

1. The message consists of images and urls and no text. I have seen 
received spam of this sort myself, and IIRC, bogofilter did not catch it.

2. Using decoy keywords in an 'X-*' email header or html keywords section.
The intent here is to throw off our spamicity counts.

3. Using html comments to break up words. This we have seen.

4. Place the message in multiple horizontally arranged html tables. John 
characterized this as 'dastardly' and I have to agree.

5. The message is in MIME multipart/alternative format with text and html 
sections. The text contains a very large, innocuous document while the html 
contains the spam.  Obviously the spammer is counting on the MUA displaying 
the html rather than the text. My initial solution for this is to score the 
parts separately and compare the scores. More on that later.

6. encode urls in decimal, hex or octal format and, optionally, multiple 
the result by some multiple of 256 since popular clients perform base 256 
arithmetic on ip addresses.

7. the message is in html format in which the body part is empty, but the 
body tag contains an 'OnLoad' attribute which references a piece of 
JavaScript that writes into the body. Worse yet, the message is encoded 
within the javascript,  and decoded when written.

8. Replace some letters with numbers

9. Use accented characters.

10. Add large random words, apparently designed to throw off
     message checksums for techniques like Vipul's Razor

We can deal with some of these by
1. parsing the html into plain text just like 'lynx -dump' would do
2. converting all text into a canonical format. This will foil charset trickery
3. If there are alternative formats, compare then for similarity for some 
definition of similar

Paul Graham spoke what he's learned since the paper. They include:

1. use header tags to increase accuracy. "Subject:spam Word" are stored as 
"Subject:spam" and "Subject:Word".
2. Maintain the original case (see above).
3. Parse html elements for attributes.
4. Don't exclude [^[:alnum]] characters. This has the interesting result 
that 'spam!!" and "spam!!!" are different words. A suggestion for unknown 
words is to look for the closest match so "spam!!!!" will return the count 
for "spam!!!"

We could extend #4 by storing a word twice: once in original case, with all 
characters and once in canonical format (lower case, alphanumeric) and if a 
word is not found, look for its canonical representation. A further 
extension, solving Graham-Cumming's #5, is to store word phrases as well as 
words.  Essentially, for every n words are stored as a single database 
entry. Yes, this increases database size.

As an optimization, we could stem the canonical copies too. This may be a 
problem for non English messages though.

Conclusions:

1. There are too many people working on different versions of statistical 
filters and IMNSHO, too much overlap and wasted effort.

2. Statistical filtering will catch a lot of spam, but not all. To be truly 
successful, one needs a layered approach where statistics only comprise a 
component.As a framework, SpamAssasin comes closest to this.

3. It has become clear than as the number and variety of users increase, 
the spamicity counts decrease since different people have different levels 
of tolerance for spam. Our approach, therefore, will be less effective for 
large sites unless we create segmented databases of some sort.

-Gyepi