html_tokenizer

Wed Feb 19 23:20:16 CET 2003

At 05:08 PM 2/19/03, Nick Simicich wrote:
>At 11:46 AM 2003-02-19 -0500, David Relson wrote:
>
>>Nick,
>>
>>It's looking like a job well done :-)
>>
>>I've got a copy of bogolexer that uses html_tokenizer.l as an alternative 
>>to the usual lexer_text_html.l.  The '-j' switch determines whether your 
>>code or the old code is used.
>>
>>It looks good and I'd like to run "make check" with it.  Unfortunately, 
>>the new code returns additional tokens which will cause the regression 
>>tests to complain.  With a sample html message, the new code returns 
>>tokens from "<body...>" and "<a ...>" tags.  Unfortunately my quick tests 
>>didn't reveal which part of the code was allowing the extras to get back 
>>to bogolexer.  Can you point me in the right direction to make them go 
>>away (at least temporarily)?
>
>Are you saying that it (intentionally) returns tokens like <a and 
><body?  Or that it returns tokens from the interior of same?

No.  It's the interior tokens that I want to suppress temporarily.  Once I 
confirm that the new code can duplicate the results of the old code I can 
allow changes and know the reason for them.

>For the former, these are returned by the item that ends (in my trial) 
>begin htoken:
>
><INITIAL>"<""/"?[[:alnum:]]*    {
>                                 printf("In %s\n", yytext);
>                                 BEGIN HTOKEN;
>                                 }
>
>If you want to suppress all tokens that are inside of a token, uncomment 
>the following.  You can cause it to "do the right thing (ignore comments) 
>with the following change to HTMLTOKEN:
>
>{HTMLTOKEN}      {printf("html token: -%s-\n", yytext);}
>
>HTMLTOKEN       "<"[^!][^\>]*">"|"<>"
>
>This, of course, is untested, but I believe it will work because I am a 
>perfect programmer (yeah, right).  Now there is another issue:  I am using 
>a really different character set for my tokens than you were.  You were 
>using a non-alphabetic character set, I was using a strictly alpha 
>followed by alnum set.  I am not going to discuss which is better, that 
>was one of my shortcuts.

I change the name of your alphanumeric TOKEN pattern to ALPHANUM and 
enabled the more complex one (to be consistent with bogofilter's normal 
behavior).

I'll let you know whether the expected changes do what I want.

Thanks.

David