html parsing question
David Relson
relson at osagesoftware.com
Mon Jul 21 21:25:49 CEST 2003
Greetings,
Today when looking at my wordlists, I noticed a lot of long tokens with
counts of 1. Looking further, I noticed that a message like:
Content-Type: text/html
<a href="http://cl.com.com/Click?q=30-u8NEIRrULwxBzvvqdPBZGbLc-8cR">
parses as:
Content-Type
text
html
href
http
cl.com.com
Click
u8NEIRrULwxBzvvqdPBZGbLc-8cR
I'm wondering if parsing of urls should:
1 - include only the domain info, e.g. up through cl.com.com/
2 - exclude parameters (anything after a '?')
What do y'all think?
David
More information about the Bogofilter
mailing list