obscured URL not being tokenized
Dan Singletary
dvsing at sonicspike.net
Sat Dec 20 17:16:29 CET 2003
The following text:
<a
href="http://%322%31.2%332.%316%30.1%305/%7a/s%69l%76e%72/f%61r%6d/i%6ed%65x.%68t%6dl">
<img border="0"
src="http://%322%31.2%332.%316%30.1%305/%7a/s%69l%76e%72/f%61r%6d/e%6et.%6ap%67"
width="500" height="300"></a>
Will get tokenized as:
head
href
head
http
head
img
head
border
head
src
head
http
head
width
head
height
The obscured IP addresses in the href and src parts of the a and img
tags aren't being parsed and identified as ip addresses. These are very
charactaristic of the spam. Also, this particular spam note takes
advantage of a trick I've seen used many times and it occurs in a lot of
my 'missed' spam: using white text on a white background full of
paragraphs and paragraphs of irrelevant text in order to confuse the
filter. I've mentioned it before, but there should be some way to tell
bogofilter to ignore text that is the same color as it's background- I
know this would require more interpretation of the HTML, and I'm not
sure how much more code there would need to be for this. I've attached
the entire offending email for your reference.
-Dan
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: spam.txt
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031220/ac07d89b/attachment.txt>
More information about the Bogofilter
mailing list