question about new spam encoding

Wed Nov 19 23:02:32 CET 2003

On Wed, 19 Nov 2003 16:38:12 -0500
Trevor Harrison <trevor-bogofilter at harrison.org> wrote:

> I just ran into a spam encoding that I haven't seen before.  In a 
> text/html message, instead of "text", they put
> text
> 
> Running thru bogolexer, all I'm seeing is the header tokens and some 
> nbsp's, but no {'s.  I'm guessing they are considered individual 
> tokens and are too short or something.
> 
> The message is here: http://www.harrison.org/~trevor/spam1.txt
> 
> 
> -Trevor
> 

Trevor,

On 2003-10-06 the decoding of escaped html characters was added to
bogofilter.  It's in 0.15.7 and 0.15.8.

With a current version of bogofilter, it decodes correctly as "text". 
If you have an older version, they're ignored because a number is all
that's left after removing special characters and bogofilter doesn't
convert numbers to tokens. 

Take the attached file, msg.html.1119.txt, and run command "bogolexer -p
< msg.html.1119.txt" and you should see "text" in the output.

David