spam conference report

Sun Jan 19 00:22:46 CET 2003

Gyepi,

A great report on the spamconference.  I'll have to commandeer my kids 
computer to listen.  (Their machine has speakers, mine don't).  Here are my 
initial thoughts ...

David

At 04:59 PM 1/18/03, Gyepi SAM wrote:

>1. The message consists of images and urls and no text. I have seen 
>received spam of this sort myself, and IIRC, bogofilter did not catch it.

Given bogofilter's algorithms, any really new, innovative kind of spam will 
get by it at first.  As it's trained, it will start to recognize the new 
spam's style and classify it correctly.  Possibly the tristate 
classification of the Robinson-Fisher algorithm will be helpful in dealing 
with spammer's innovations.

>2. Using decoy keywords in an 'X-*' email header or html keywords section.
>The intent here is to throw off our spamicity counts.

Creating compound symbols, e.g. subject:hello, is likely to help. Running 
with default settings (Robinson algorithm, ROBX of 0.415, and MIN_DEV of 
0.1) bogofilter won't include such new tokens in the spamicity calculation, 
so they'll have not effect.  Training on such tokens will teach bogofilter 
that they are spam indicators.  Lots of decoy words in headers will 
generate lots of spam specific tokens and help bogofilter classify the 
message as spam.

>3. Using html comments to break up words. This we have seen.

I've got experimental code to trim out the comments.  I expect it'll be in 
cvs later this weekend.

>4. Place the message in multiple horizontally arranged html tables. John 
>characterized this as 'dastardly' and I have to agree.

Are we talking one letter per column?  If not, why the description as 
"dastardly".

>5. The message is in MIME multipart/alternative format with text and html 
>sections. The text contains a very large, innocuous document while the 
>html contains the spam.  Obviously the spammer is counting on the MUA 
>displaying the html rather than the text. My initial solution for this is 
>to score the parts separately and compare the scores. More on that later.

I'm all ears!!!  We could add user options for combining them, i.e. "use 
max score", "if html and plain text scores, use html score", ...

>6. encode urls in decimal, hex or octal format and, optionally, multiple 
>the result by some multiple of 256 since popular clients perform base 256 
>arithmetic on ip addresses.

Urls in the headers can't be changed too much, can they?  We already have 
the "block_on_subnets" option for creating special tokens, like 
url:123.456.78.90.  We can extend that as needed.

>7. the message is in html format in which the body part is empty, but the 
>body tag contains an 'OnLoad' attribute which references a piece of 
>JavaScript that writes into the body. Worse yet, the message is encoded 
>within the javascript,  and decoded when written.

Sounds like the appearance of javascript will be a tip-off, though we may 
need to look inside html comments to spot it.

>8. Replace some letters with numbers
>
>9. Use accented characters.

These just add tokens to the wordlists.  Doesn't seem very exciting.

>10. Add large random words, apparently designed to throw off
>     message checksums for techniques like Vipul's Razor

Previous comments on decoy words apply here.

>We can deal with some of these by
>1. parsing the html into plain text just like 'lynx -dump' would do
>2. converting all text into a canonical format. This will foil charset 
>trickery
>3. If there are alternative formats, compare then for similarity for some 
>definition of similar

I can see that processing html is going to be an ongoing process.  Html 
makes so many things possible, whatever solution exists now will soon be 
obsolete.

>Paul Graham spoke what he's learned since the paper. They include:
>
>1. use header tags to increase accuracy. "Subject:spam Word" are stored as 
>"Subject:spam" and "Subject:Word".
>2. Maintain the original case (see above).
>3. Parse html elements for attributes.
>4. Don't exclude [^[:alnum]] characters. This has the interesting result 
>that 'spam!!" and "spam!!!" are different words. A suggestion for unknown 
>words is to look for the closest match so "spam!!!!" will return the count 
>for "spam!!!"
>
>We could extend #4 by storing a word twice: once in original case, with 
>all characters and once in canonical format (lower case, alphanumeric) and 
>if a word is not found, look for its canonical representation. A further 
>extension, solving Graham-Cumming's #5, is to store word phrases as well 
>as words.  Essentially, for every n words are stored as a single database 
>entry. Yes, this increases database size.

Header tags were suggested months ago.  I'm in favor...

I need to think about maintaining case.  I'm not sure if it's useful.

Talking of canonical word forms, reminds me of soundex encoding.  An idea 
to think about.

>As an optimization, we could stem the canonical copies too. This may be a 
>problem for non English messages though.
>
>Conclusions:
>
>1. There are too many people working on different versions of statistical 
>filters and IMNSHO, too much overlap and wasted effort.
>
>2. Statistical filtering will catch a lot of spam, but not all. To be 
>truly successful, one needs a layered approach where statistics only 
>comprise a component.As a framework, SpamAssasin comes closest to this.
>
>3. It has become clear than as the number and variety of users increase, 
>the spamicity counts decrease since different people have different levels 
>of tolerance for spam. Our approach, therefore, will be less effective for 
>large sites unless we create segmented databases of some sort.

The code for using multiple worlists, i.e. combining scores from system 
level wordlists and user wordlists, works and is in cvs.

As I write all these "it's in cvs" comments, I realize that it has been 
quite a while since 0.9.1.2 was released and that bogofilter has lots of 
new features in cvs.  Adding these two facts together, the sum looks like 
"Release time!" with the answer being "next week".

David