The concept of using lex to parse comments and html tags out of html...

Nick Simicich njs at scifi.squawk.com
Mon Feb 17 12:56:05 CET 2003


Long rambling post.  Summary.

1. I've made progress on the code to reorder comments and html tags.
2. I want to do it all in lex.
3. It is harder than I thought when I started.  For that matter, it is 
harder than I thought it was a few hours ago.
4. I've written some proof of concept code I would like people to look 
at.  Some html tags should be reordered, some should not.  My code tries to 
minimize the number of cases reordered to the ones that should be 
reordered.  The code does reordering, and is lex driven. There is no 
separate step for reordering the text.

Currently, the processing to remove comments is "out of line" --- that is, 
it is done as a separate step from the Lex scanning.  That is, it is in a 
step preceeding the lexx tokenizing.  The comments made when I asked why 
this was were, "lex/flex breaks the token when it sees a html tag or comment."

Dave and I agreed that neither of us knew how to make lex not do that.

At the time, I volunteered to write the code to move the html comments out 
of line. (This was a mistake, trust me.) I also had been seeing a lot of 
spam where meaningless tags were used to break words into pieces.

Th<!-- comment -->is could be used to br<xyzzy>ea</www>k words up, but 
unknown tags will also break words as far as the existing naive parser goes.

Paired tags like th<b></b>is can also break words without altering 
format.  It is likely that all of those tags will be ignored by the various 
formatters that people use to look at html mail, however, they will all 
cause bogofilter to break tokens.  This is stuff that is all being done by 
spammers, today, to make it harder for statistical analyzers and simple 
string comparators to determine what is and is not spam.

The key here is "eyespace". Analyzers like bogofilter work in ascii, but 
renderers use "eyespace". I doubt that we can win against someone who is 
willing to use tables to mash text up, but it may be that tables in e-mail 
are a strong indicator of spam.  However, some of the tricks used by 
spammers can be fought.

I had started writing the code to move the comments out of line, and to 
save the other html tags out of line.  My initial thought was to add the 
code to the current html comment deleter.  I was going to shift all the 
HTML tags at the same point.

After playing with moving all tags out of line, I realized I was 
wrong.  One can't blindly move all tags out of line, because some tags 
cause a formatting break -- luckily my test case ended up with 
inappropriately mashed up tokens, and I started looking at the mess.

There are breaking html tags, and there are non-breaking html tags, and in 
order to decide if the current token can be parsed stand alone, you need to 
know if there is a breaking tag somewhere attached to the current 
token.  My theory is that if you have a string like:

foo<!-- comment --><br>bar

Because of the br, foo is one token, but bar is another.  However, if you 
have a string like:

foo<!-- comment
--><b></b>bar

none of the html tags between foo and bar cause a formatting break, so you 
want to process "foobar" as a token, and you want to stash "<!-- comment 
--><b></b>" for later processing.

The only superficial difference between the two cases is the value of the 
html tokens.  (The code I have written handles these cases).

This ends up being complex enough that it should be driven by something 
like lex that abstracts things out a bit.  In order to do this properly, 
you need to do lookahead, and you need to be able to handle a fair amount 
of complexity.  Lex can handle the complexity, using states, and it has the 
ability to do lookahead, and to match complex regular expressions against 
multi-line spaces.  It can be driven to do reprocessing, and so forth.

I have made a cut at moving the processing back into Lex. I have gotten the 
code written that deals with eliding a tag, pushing the partial token back 
into the processing queue, and continuing, such that mes<scrawny scum><!-- 
comment --><whatever mopery and dopery>sage is processed as "message" in a 
single lexical token, and the "<scrawny scum><!-- comment --><whatever 
mopery and dopery>" is be kept somewhere as a player to be named and parsed 
later.  The method I am using is to use unput() to push the "mes" back onto 
the processing stack.  I got it working entirely in Lex because I wanted to 
simplify my life - I know that this technique can't be used for real, 
because the decoding of quoted printable and the decoding of base64, and 
the segmenting of html and plain text via mime boundaries and other things 
are being done, and those things will have to stay outside of lex.

But what I did was to use the call to yywrap to indicate that we were at 
end of file, and then I simply modified the generated include to add a bit 
of processing to copy the data from a single linked list of buffers back 
into the lex input.  The processing should be such that it simply parses 
the tokens out of the tags that it has stashed away.

The other reason I felt that this processing should be in lex is that some 
tags are breaking tags.  For example, if you find a <br>, that is going to 
cause a break, and you do not want to change mes<br>sage into message, 
because that is not how it will format in "eyespace". In that case, "mes" 
is one token and "sage" is another.

I am looking at which tokens cause a formatting break.  The theory is that 
the right thing to do is to construct a regex that will recognize them, and 
that should mean that lex/Flex is the right place to build them.  p, br, 
li, h[1-6],  hr, div, and, I am sure, others are "breaking" tokens, whereas 
a, b, and many others (including all unrecognized tokens) are not.  If the 
tag matches "<"{BREAKING}([ \t\n][^>]|"")">" (or there are a series of 
tags, and any tag in the series is a breaking tag) then you can peel the 
token that is touching that series of tags, and process those tags in 
place, but if you have a token, and then you have a series of tags and none 
of those tags are breaking, you want to peel them all off and stash them.

The patterns I have written make heavy use of lookahead and the Lex rule 
that the longest and then the first pattern to match is the controlling 
pattern.  I use that to tell which rule I want to look at (whether this is 
a thing that needs to be stashed or one that does not).

The comment token is just one case of the non-breaking token.  In other 
words, comment is no different than xyzzy - html processors pretty 
universally ignore tokens they completely do not understand.  Except for 
one thing, of course --- when you are in a comment, the only thing that the 
most common formatters will recognize as a token end is "-->".

I started out by trying to modify lexer_text_html.l, but I was unable to 
make it work. Frankly, I am teaching myself lex as I go, and I made changes 
that caused error messages, which I was then unable to debug.  So I decided 
to simplify my life by making the changes in a separate parser, where all 
changes were in a single file.  That is the way it is now.  You *should* be 
able to get http://majordomo.squawk.com/njs/html_tokenize.l, then

flex html_tokenize.l

cc yy.lex.c -g -o html_tokenize

and finally:

./html_tokenize < some_input_test_file.  Note that the file should be plain 
text - no quoted printable or base64.  I was testing with a hotmail file 
that was text/html in the RFC822 headers.

Like I said, I could not figure out what I was doing wrong, so I decided to 
start a new parser just to play.  I have
http://majordomo.squawk.com/njs/html_tokenize.l as a copy of what I have 
gotten done.  The basic parser works, and the putback code works (this is 
not how I would actually write the final version, with a strdup and free 
for every putback, although you need to copy before you start the unput() 
stuff for maximum flexibility).  But the patterns are how I would write 
them.  I do not know if it would be a good idea to try and do quoted 
strings inside the html tokens.  That would not be horrible, it would 
require an additional state inside the html state, but I would have to 
determine how the various renderers handled quoted strings inside of tags.

Finally, there has been strong feeling that we want to parse the contents 
of html as tags. I believe this myself, and also believe that tag 
population might be a classification scheme.  I am using lex states to 
manage the condition of being inside a tag and parsing tokens from inside 
the tag. I believe that it would be a good idea to keep the tags as <tag 
and </tag because that is both easy, obvious, and it allows for simple 
processing.

I think that we should consider stopping the differentiation between html 
comments and other html tags and that we should always do the html 
processing.  (Other than the closing of the comment).

I used http://www.w3schools.com/tags/tag_phrase_elements.asp, as the list 
of all possible tags, and I classified the html markup tags listed there 
into "breaking" and non-breaking.  I believe that the following lex regular 
expression, prepended with "<", matches all of the html markup tags that 
cause format breaking.  That is, a stand-alone </pre> tag will cause a 
break in formatting even if there is no <pre> tag.  So if there are two 
tokens with HTML tags but no non-captured whitespace between them, and one 
or more of the html tags matches the below regexp, then there will be 
"eyespace" between the tags, whereas if none of the html tags match the 
tags between the two tokens, then there will be no eyespace between the 
tokens, and they should be processed as a single token.

p|br|li|h[1-6]|hr|title|table|center|dd|dt|iframe|img|input|select|td|textarea|th|\/?(div|blockquote|pre|dir|dl|fieldset|legend|form|menu|ol|ul) 


I am probably wrong about this regular expression, and I invite 
correction.  Please e-mail me a sample and the browser you used.  I 
"preferred" to use IE for my testing, but I checked some in Lynx as 
well.  I changed the comment processing to match IE and Netscape's comment 
processing - they will only recognize <!-- as the beginning of a comment 
and --> as the end.  As it turns out, you can code comments as: <! -- -- > 
if you want, and that will *happen* to work, because the processing that is 
going on is simply the "unknown tag" processing. <! is an unknown tag, 
which then terminates at >, while <!-- is a comment and only terminates at 
-->.  I handled that by adding a state that is entered when a comment is 
entered, which you can only exit by matching -->.

These are the changes I believe should be made:

1. I believe that the current code that is used to remove comments from 
html outside of the lex process should be deleted.

2. I believe that there should be, at most, a single switch that controls 
all reordering.  "At most" allows for less than one switch, so that the 
code would be written to always process html as html and to reorder.  If 
the decision is made to not reorder, then the switch should control a state 
in the Lex parser, that then selects simpler actions and patterns that 
allow for the separation of tokens in a simpler way so that overhead is 
lowered.

3.  I believe we should process comments the IE way.  Again, this goes back 
to eyespace.  I believe that the contents of comments (and other html tags) 
should be processed for tokens, but the question is what you relocate and 
move and so forth.

Now, I am still unsure about one thing.  There are tags, like, for example 
<script></noscript> that can be inserted as a pair.  You can use the tag to 
elide things. "spam<scipt> </noscript>mish" looks like a broken word (even 
to my code, because I will break at the space)...but it will be rendered, 
in "eyespace", as a single word.  I believe this will work in lynx...or in 
IE with scripts disabled or enabled.  My first inclination is to not allow 
for this because it will slow the parser down and we have seen no one use 
this particular approach.  If someone codes it as 
"spam<script></noscript>mish" my scheme will collapse it to "spammish", but 
I am making the assumption that contained white space is white space, if it 
is not inside a tag.  There are a few tags like this that can be used to 
elide.  It is certainly possible to add states, but to do this as 
"lookahead" would require that you put these tags in pairs.  But everything 
between a style and /style gets elided, or script and noscript.  Even 
<title> and </title> will elide what is between them without causing a break.

There are probably quite a few of these, and I suspect that the regular 
expression that which would check for all of these would be seriously long.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list