obscured URL not being tokenized

Dan Singletary dvsing at sonicspike.net
Sat Dec 20 17:16:29 CET 2003


The following text:

<a 
href="http://%322%31.2%332.%316%30.1%305/%7a/s%69l%76e%72/f%61r%6d/i%6ed%65x.%68t%6dl">
<img border="0" 
src="http://%322%31.2%332.%316%30.1%305/%7a/s%69l%76e%72/f%61r%6d/e%6et.%6ap%67" 
width="500" height="300"></a>

Will get tokenized as:
head
href
head
http
head
img
head
border
head
src
head
http
head
width
head
height

The obscured IP addresses in the href and src parts of the a and img 
tags aren't being parsed and identified as ip addresses.  These are very 
charactaristic of the spam.  Also, this particular spam note takes 
advantage of a trick I've seen used many times and it occurs in a lot of 
my 'missed' spam: using white text on a white background full of 
paragraphs and paragraphs of irrelevant text in order to confuse the 
filter.  I've mentioned it before, but there should be some way to tell 
bogofilter to ignore text that is the same color as it's background- I 
know this would require more interpretation of the HTML, and I'm not 
sure how much more code there would need to be for this.  I've attached 
the entire offending email for your reference.

-Dan
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: spam.txt
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031220/ac07d89b/attachment.txt>


More information about the Bogofilter mailing list