html_tokenizer

Barry Gould BarryGould at PennySaverUSA.net
Fri Feb 21 00:42:27 CET 2003


I received spam with no text at all in eyespace, just some image links like 
this:
<TD><a href="http://www.abe10.net/mortgage/">   <IMG NAME="a012" 
SRC="http://www.abe10.net/Ads/MORTGAGE/01/001_2x1.gif" WIDTH="265" 
HEIGHT="81" BORDER="0"></a></TD>

Therefore, I think it would be nice if the URL is broken up (if it isn't 
already), so that the word "mortgage" would be recognized.

It looks like it is working this way in 0.10.0

Thanks,
Barry

At 03:27 PM 2/20/2003, Nick Simicich wrote:
>As a side issue in html_tokenizer, it might be reasonable to deal with 
>quoted strings inside of tokens.  Specifically, if someone codes something as:
>
><a href="http://foo.bar.com/whatever.html"> how should that be tokenized?
>
>Should the entire quoted string be one token?
>
>Should a > inside the quoted string be respected?
>
><a href="http://foo.bar.com/whatever >.html" >  Current code ends the 
>token at the  before the first  > and the token at the first >, but 
>perhaps it should be ended at the " and the token at the second >?
>
>It would be simple to end the quoted string by either (1) looking for 
>pairs of quotes in preference to other things when you are in the state 
>that indicates that you are in a token that is not a comment or...
>
>starting another state that is used when you have a token that starts with 
>a " when you are inside a html tag.
>
>There are two reasons for doing this "right".  One is that tagging tokens 
>as from within html tags may be important to telling spam from 
>non-spam.  HTML with table tags might be spam, while a question about how 
>is <td> used, in a plain text section, might not be spam.
>
>the other might be that this might be a simple way for spammers to make 
>things seem less spammy.
>Does anyone have any spam-in-the-wild cases of people protecting their 
>spam words using any technique like this?





More information about the Bogofilter mailing list