html_tokenizer

Nick Simicich njs at scifi.squawk.com
Wed Feb 19 23:08:40 CET 2003


At 11:46 AM 2003-02-19 -0500, David Relson wrote:

>Nick,
>
>It's looking like a job well done :-)
>
>I've got a copy of bogolexer that uses html_tokenizer.l as an alternative 
>to the usual lexer_text_html.l.  The '-j' switch determines whether your 
>code or the old code is used.
>
>It looks good and I'd like to run "make check" with it.  Unfortunately, 
>the new code returns additional tokens which will cause the regression 
>tests to complain.  With a sample html message, the new code returns 
>tokens from "<body...>" and "<a ...>" tags.  Unfortunately my quick tests 
>didn't reveal which part of the code was allowing the extras to get back 
>to bogolexer.  Can you point me in the right direction to make them go 
>away (at least temporarily)?

Are you saying that it (intentionally) returns tokens like <a and 
<body?  Or that it returns tokens from the interior of same?

For the former, these are returned by the item that ends (in my trial) 
begin htoken:

<INITIAL>"<""/"?[[:alnum:]]*    {
                                 printf("In %s\n", yytext);
                                 BEGIN HTOKEN;
                                 }

If you want to suppress all tokens that are inside of a token, uncomment 
the following.  You can cause it to "do the right thing (ignore comments) 
with the following change to HTMLTOKEN:

{HTMLTOKEN}      {printf("html token: -%s-\n", yytext);}

HTMLTOKEN       "<"[^!][^\>]*">"|"<>"

This, of course, is untested, but I believe it will work because I am a 
perfect programmer (yeah, right).  Now there is another issue:  I am using 
a really different character set for my tokens than you were.  You were 
using a non-alphabetic character set, I was using a strictly alpha followed 
by alnum set.  I am not going to discuss which is better, that was one of 
my shortcuts.



>Thanks.
>
>David
>--------------------------------------------------------
>David Relson                   Osage Software Systems, Inc.
>relson at osagesoftware.com       Ann Arbor, MI 48103
>www.osagesoftware.com          tel:  734.821.8800
>
>
>

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list