html parsing question

David Relson relson at osagesoftware.com
Mon Jul 21 21:25:49 CEST 2003


Greetings,

Today when looking at my wordlists, I noticed a lot of long tokens with 
counts of 1.  Looking further, I noticed that a message like:

Content-Type: text/html

<a href="http://cl.com.com/Click?q=30-u8NEIRrULwxBzvvqdPBZGbLc-8cR">

parses as:

Content-Type
text
html
href
http
cl.com.com
Click
u8NEIRrULwxBzvvqdPBZGbLc-8cR

I'm wondering if parsing of urls should:

1 - include only the domain info, e.g. up through cl.com.com/
2 - exclude parameters (anything after a '?')

What do y'all think?

David





More information about the Bogofilter mailing list