spam conference report

Sun Jan 19 03:06:29 CET 2003

At 05:49 PM 2003-01-18 -0500, Gyepi SAM wrote:
[...]
>Things we should consider doing.
>
>Bill Yerazunis talked about the CRM114 Discriminator (crm114.sf.net) which,
>among other things, counts multiple words in addition to single words. He 
>claims
>that this increases accuracy and Paul Graham later remarked that he might 
>add that feature to his filter. I think this even help with the 
>HTML-comment-separating-word-components problem, though there is a better way.

http://web.mit.edu/webcast/spamconf03/mit-spamconf-s1-26100-17jan03-0900-80k.ram 
- his talk starts at 12 minutes in. The general lecture is worth listening 
to. His general design (and, as I said, the talk is well worth listening 
to) is to look at not only individual words but also phrases, and phrases 
with missing words.  The phrases are all hashed.  The words are not saved, 
the hash is saved, which means that he has a smaller database, but it ends 
up larger because he has much larger numbers of items saved, since he saves 
words, phrases, etc.

He will end up with significantly more "characteristics" than he will have 
spam to start with.  He only trains on error in general.  He believes that 
you have to be up in the 99+% correct to make spam uneconomical, and this 
is his goal.

He handles both base64 and comments in spam by decoding the base 64 and 
stripping the comments out of the spam and appending the decoded text or 
the stripped html back to the main section.

He claims he caught *everything* except the time traveler spam and one 
other with his latest code --- and also that the phrase training eliminates 
the "middle of the road message".  Most of his messages classify way into 
the spam section or way into the non-spam section.

Frankly, I want to try installing CRM114 and seeing if it is that much more 
accurate, and if it will run on my small iron.

>John Graham-Cumming, creator of POPFile (popfile.sf.net) talked about some 
>of the tricks that spammers use to hide from filters. They include:
>
>1. The message consists of images and urls and no text. I have seen 
>received spam of this sort myself, and IIRC, bogofilter did not catch it.
>
>2. Using decoy keywords in an 'X-*' email header or html keywords section.
>The intent here is to throw off our spamicity counts.
>
>3. Using html comments to break up words. This we have seen.
>
>4. Place the message in multiple horizontally arranged html tables. John 
>characterized this as 'dastardly' and I have to agree.
>
>5. The message is in MIME multipart/alternative format with text and html 
>sections. The text contains a very large, innocuous document while the 
>html contains the spam.  Obviously the spammer is counting on the MUA 
>displaying the html rather than the text. My initial solution for this is 
>to score the parts separately and compare the scores. More on that later.
>
>6. encode urls in decimal, hex or octal format and, optionally, multiple 
>the result by some multiple of 256 since popular clients perform base 256 
>arithmetic on ip addresses.
>
>7. the message is in html format in which the body part is empty, but the 
>body tag contains an 'OnLoad' attribute which references a piece of 
>JavaScript that writes into the body. Worse yet, the message is encoded 
>within the javascript,  and decoded when written.
>
>8. Replace some letters with numbers
>
>9. Use accented characters.
>
>10. Add large random words, apparently designed to throw off
>     message checksums for techniques like Vipul's Razor
>
>
>We can deal with some of these by
>1. parsing the html into plain text just like 'lynx -dump' would do

Which was basically what he came up with...to deal with html, you have to 
use what they called "eye-space" as opposed to "ascii-space".  Spammers are 
exploiting the differences, and html makes those differences rampant.  For 
example, "viagra", or "via<B></b>gra" or "via<font 
color=red></font>gra" are all the same when rendered.

>2. converting all text into a canonical format. This will foil charset 
>trickery
>3. If there are alternative formats, compare then for similarity for some 
>definition of similar
>
>Paul Graham spoke what he's learned since the paper.

I do not recall this being in the video.

>They include:
>
>1. use header tags to increase accuracy. "Subject:spam Word" are stored as 
>"Subject:spam" and "Subject:Word".
>2. Maintain the original case (see above).
>3. Parse html elements for attributes.
>4. Don't exclude [^[:alnum]] characters. This has the interesting result 
>that 'spam!!" and "spam!!!" are different words. A suggestion for unknown 
>words is to look for the closest match so "spam!!!!" will return the count 
>for "spam!!!"
>
>We could extend #4 by storing a word twice: once in original case, with 
>all characters and once in canonical format (lower case, alphanumeric) and 
>if a word is not found, look for its canonical representation. A further 
>extension, solving Graham-Cumming's #5, is to store word phrases as well 
>as words.  Essentially, for every n words are stored as a single database 
>entry. Yes, this increases database size.

The CRM114 discriminator solved this by never storing words or 
phrases.  Every phrase, no matter how long, or every word, is stored as a 
checksum.  It does make the database larger because there are simply more 
elements.

>As an optimization, we could stem the canonical copies too. This may be a 
>problem for non English messages though.

Lots of people were basically hand-waving for non-English.  Perhaps the 
popfile people had the most non-English users.

>Conclusions:
>
>1. There are too many people working on different versions of statistical 
>filters and IMNSHO, too much overlap and wasted effort.

Of the few pitches I saw, I did not think so.  They had remarkably 
different approaches.  Making them work together on one approach is 
probably a mistake at this point, just too early.  There will be tools that 
fall by the wayside, tools where the approaches fail. I thought that this 
would be one. But, as an example, Bill Yerazunis's approach was so 
different from the one that bogofilter uses that I can't imagine them 
beingco-developed.

>2. Statistical filtering will catch a lot of spam, but not all. To be 
>truly successful, one needs a layered approach where statistics only 
>comprise a component.As a framework, SpamAssasin comes closest to this.

There was a warning given on this:  If the whole world implements Bayesian 
filters, there will be a superspam (like the next time travel spam) will 
get past everything and everyone will learn from that spam.  Homogenity in 
spam filtering is bad.

>3. It has become clear than as the number and variety of users increase, 
>the spamicity counts decrease since different people have different levels 
>of tolerance for spam. Our approach, therefore, will be less effective for 
>large sites unless we create segmented databases of some sort.

I think that the big advantage that bogofilter has is speed.  One person 
quoted "40k a second filtering, 20k/second training" --- and this was on a 
much faster machine than the P-90 I am using. The popfile people were 
depending on independent users. I think that bogofilter's ability to run as 
a filter for an ISP (I am not sure how you would express it's filtering in 
those terms, message size/sec for filtering and training but I think it is 
much faster than their system was on your canonical 1 Gz iron.

--
If you doubt that magnet therapy works, I put to you this observation: When 
refrigerators were first invented, in the 1940s, they were rather 
unreliable, but then they became significantly more reliable. The basic 
design of the refrigerator did not change, and we all know that quality was 
important back then, so I doubt that newer refrigerators are made better. 
Refrigerators have become more reliable because of the rise of the 
refrigerator magnet.
Nick Simicich - njs at scifi.squawk.com