JavaScript (was: Re: What's that?)

Thu Oct 14 21:11:32 CEST 2004

A few late comments... (I wrote them a long time ago but forgot to send
them. Oops.)

On Sat, 11 Sep 2004, David Relson wrote:

> What I had in mind was a bit simpler.  At present, bogofilter ignores
> most html tags.  It parses "a", "img", and "font" tags so that the
> tokens within them go into scoring the message.  I'm thinking of adding
> "script" to that list, so that "JScript.Encode" and other such script
> info would be included in the scoring of the message.  It'd be a small
> change and would help deal with this level of obfuscation.

Have you considered generating tokens from the tags themselves?

Moreover, it might be worth considering to add more tags to the list (for
instance, <form> appears to be quite frequent is spams) and to distinguish
attribute names (or perhaps even values?) from ordinary words (i.e. <img
src=...> would generate "html:src" (or even "img:src" or "html:img:src"),
and "html:img" (for the tag itself), rather than "src").

> I'm not even considering about decoding JScript.Encode (as is presently
> done with base64, uuencode, QP, etc).

Good. :)

On Mon, 13 Sep 2004, Tom Anderson wrote:

> That might be an interesting prefilter to bogofilter... a program that
> interprets the javascript and passes along a static version of the
> result to bogofilter for scoring.  I'm sure such a program could make
> liberal use of the Mozilla engine to do so.

The bad news is that

1. you will lose one of the greatest strengths of Bogofilter
   (i.e. its efficiency),

2. you will fight a battle you cannot win because spammers will
   learn to use the most bizzare tricks, including any idiosyncratic
   features/bugs of MSIE/MSOE, to confuse your interpreter

And you can also introduce a HTML renderer to deobfuscate obfuscated HTML
(white text on white background, very small text...) and OCR to extract
text from images. And some kind of artificial intelligence to filter out
irrelevant text (e.g. jokes; I get quite a lot of them...and some of them
are rather good ;>).  In fact you should do these two things before you
start interpreting JS because HTML and images are abuse by spammers now
while JS-obfuscation is still very rare (afaik; anyway, I am sure this is
going to change sooner or later).

> This is clearly not a rule.  Plenty of unsophisticated end-users take 
> advantage of their email clients' built-in abilities to send dynamic 
> messages. 

I have never seen such a ham. Nevertheless, if people send them, they
(well, some of them, the part having a slightest bit of clue) are going
to learn to avoid them sooner or later when the next big worm spreads via
email-borne JavaScript (or even "encoded" JS) and infects every luser out
there. The big corporations will start filtering anything looking like JS
from their email after such an incident and this will be the end of
"dynamic messages".

> Javascript is NOT necessarily a spam indicator.  It would depend
> entirely on your training corpus.

Yes, I agree. It should depend on one's corpus. Nevertheless, I believe it
would be pretty difficult to find a corpus where "JS enabled" spams do not
outnumber "JS enabled" hams by at least an order of magnitude.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."