obscured URL not being tokenized

David Relson relson at osagesoftware.com
Sat Dec 20 19:17:41 CET 2003


On Sat, 20 Dec 2003 08:16:29 -0800
Dan Singletary <dvsing at sonicspike.net> wrote:

> The following text:
> 
> <a 
> href="http://%322%31.2%332.%316%30.1%305/%7a/s%69l%76e%72/f%61r%6d/i%
> 6ed%65x.%68t%6dl"><img border="0" 
> src="http://%322%31.2%332.%316%30.1%305/%7a/s%69l%76e%72/f%61r%6d/e%6
> et.%6ap%67" width="500" height="300"></a>

Dan,

The %dd encodings are probably easy to deal with.  I'll take a look at
the code.

> The obscured IP addresses in the href and src parts of the a and img 
> tags aren't being parsed and identified as ip addresses.  These are
> very charactaristic of the spam.  Also, this particular spam note
> takes advantage of a trick I've seen used many times and it occurs in
> a lot of my 'missed' spam: using white text on a white background full
> of paragraphs and paragraphs of irrelevant text in order to confuse
> the filter.  I've mentioned it before, but there should be some way to
> tell bogofilter to ignore text that is the same color as it's
> background- I know this would require more interpretation of the HTML,
> and I'm not sure how much more code there would need to be for this. 
> I've attached the entire offending email for your reference.

Dealing with color is more challenging.  The knowledge that "#FFFFFF"
means "white" and "#000000" means "black" is relatively easy.  More
difficult is that "#FEFFFF", "#FFFEFF", "#FFFFFE", and "#FEFEFE" are
(for all intents and purposes) the same as #FFFFFF.  However, "#808080"
is clearly different.  To do the job "right" calls for recognizing
colors and  judging sameness -- not trivial.

I make no promises, but perhaps one day ...

David




More information about the Bogofilter mailing list