question about new spam encoding

Wed Nov 19 23:15:23 CET 2003

Trevor Harrison wrote:
> I just ran into a spam encoding that I haven't seen before.  In a 
> text/html message, instead of "text", they put text
> 
> Running thru bogolexer, all I'm seeing is the header tokens and some 
> nbsp's, but no {'s.

Since the body part is text/html, bogofilter should be decoding all
the { stuff into text, and tokenizing based on that.  And for
me, with version 0.15.8, that is what is happening.  And in fact, it
gets detected as spam (spamicity=0.948258)
A sample of body
tokens found:

   get_token: 1 "Refinance"
   get_token: 1 "today"
   get_token: 1 "low"
   get_token: 1 "Save"
   get_token: 1 "thousands"
   get_token: 1 "nbsp"
   get_token: 1 "dollars"
    ...etc...

However, i did notice two things unexpected.  There's an http URL in
the body, http://www.quick-home-loan-search.biz/, which does not
get tokenized.

Also, an IP address from the header (200.59.68.139, in a Received: line),
which doesn't get tagged with rcvd: or head:

   get_token: 5 "200.59.68.139"

Is that to be expected?

-Matt