The concept of using lex to parse comments and html tags out of html...
Nick Simicich
njs at scifi.squawk.com
Mon Feb 17 12:56:05 CET 2003
Long rambling post. Summary.
1. I've made progress on the code to reorder comments and html tags.
2. I want to do it all in lex.
3. It is harder than I thought when I started. For that matter, it is
harder than I thought it was a few hours ago.
4. I've written some proof of concept code I would like people to look
at. Some html tags should be reordered, some should not. My code tries to
minimize the number of cases reordered to the ones that should be
reordered. The code does reordering, and is lex driven. There is no
separate step for reordering the text.
Currently, the processing to remove comments is "out of line" --- that is,
it is done as a separate step from the Lex scanning. That is, it is in a
step preceeding the lexx tokenizing. The comments made when I asked why
this was were, "lex/flex breaks the token when it sees a html tag or comment."
Dave and I agreed that neither of us knew how to make lex not do that.
At the time, I volunteered to write the code to move the html comments out
of line. (This was a mistake, trust me.) I also had been seeing a lot of
spam where meaningless tags were used to break words into pieces.
Th<!-- comment -->is could be used to br<xyzzy>ea</www>k words up, but
unknown tags will also break words as far as the existing naive parser goes.
Paired tags like th<b></b>is can also break words without altering
format. It is likely that all of those tags will be ignored by the various
formatters that people use to look at html mail, however, they will all
cause bogofilter to break tokens. This is stuff that is all being done by
spammers, today, to make it harder for statistical analyzers and simple
string comparators to determine what is and is not spam.
The key here is "eyespace". Analyzers like bogofilter work in ascii, but
renderers use "eyespace". I doubt that we can win against someone who is
willing to use tables to mash text up, but it may be that tables in e-mail
are a strong indicator of spam. However, some of the tricks used by
spammers can be fought.
I had started writing the code to move the comments out of line, and to
save the other html tags out of line. My initial thought was to add the
code to the current html comment deleter. I was going to shift all the
HTML tags at the same point.
After playing with moving all tags out of line, I realized I was
wrong. One can't blindly move all tags out of line, because some tags
cause a formatting break -- luckily my test case ended up with
inappropriately mashed up tokens, and I started looking at the mess.
There are breaking html tags, and there are non-breaking html tags, and in
order to decide if the current token can be parsed stand alone, you need to
know if there is a breaking tag somewhere attached to the current
token. My theory is that if you have a string like:
foo<!-- comment --><br>bar
Because of the br, foo is one token, but bar is another. However, if you
have a string like:
foo<!-- comment
--><b></b>bar
none of the html tags between foo and bar cause a formatting break, so you
want to process "foobar" as a token, and you want to stash "<!-- comment
--><b></b>" for later processing.
The only superficial difference between the two cases is the value of the
html tokens. (The code I have written handles these cases).
This ends up being complex enough that it should be driven by something
like lex that abstracts things out a bit. In order to do this properly,
you need to do lookahead, and you need to be able to handle a fair amount
of complexity. Lex can handle the complexity, using states, and it has the
ability to do lookahead, and to match complex regular expressions against
multi-line spaces. It can be driven to do reprocessing, and so forth.
I have made a cut at moving the processing back into Lex. I have gotten the
code written that deals with eliding a tag, pushing the partial token back
into the processing queue, and continuing, such that mes<scrawny scum><!--
comment --><whatever mopery and dopery>sage is processed as "message" in a
single lexical token, and the "<scrawny scum><!-- comment --><whatever
mopery and dopery>" is be kept somewhere as a player to be named and parsed
later. The method I am using is to use unput() to push the "mes" back onto
the processing stack. I got it working entirely in Lex because I wanted to
simplify my life - I know that this technique can't be used for real,
because the decoding of quoted printable and the decoding of base64, and
the segmenting of html and plain text via mime boundaries and other things
are being done, and those things will have to stay outside of lex.
But what I did was to use the call to yywrap to indicate that we were at
end of file, and then I simply modified the generated include to add a bit
of processing to copy the data from a single linked list of buffers back
into the lex input. The processing should be such that it simply parses
the tokens out of the tags that it has stashed away.
The other reason I felt that this processing should be in lex is that some
tags are breaking tags. For example, if you find a <br>, that is going to
cause a break, and you do not want to change mes<br>sage into message,
because that is not how it will format in "eyespace". In that case, "mes"
is one token and "sage" is another.
I am looking at which tokens cause a formatting break. The theory is that
the right thing to do is to construct a regex that will recognize them, and
that should mean that lex/Flex is the right place to build them. p, br,
li, h[1-6], hr, div, and, I am sure, others are "breaking" tokens, whereas
a, b, and many others (including all unrecognized tokens) are not. If the
tag matches "<"{BREAKING}([ \t\n][^>]|"")">" (or there are a series of
tags, and any tag in the series is a breaking tag) then you can peel the
token that is touching that series of tags, and process those tags in
place, but if you have a token, and then you have a series of tags and none
of those tags are breaking, you want to peel them all off and stash them.
The patterns I have written make heavy use of lookahead and the Lex rule
that the longest and then the first pattern to match is the controlling
pattern. I use that to tell which rule I want to look at (whether this is
a thing that needs to be stashed or one that does not).
The comment token is just one case of the non-breaking token. In other
words, comment is no different than xyzzy - html processors pretty
universally ignore tokens they completely do not understand. Except for
one thing, of course --- when you are in a comment, the only thing that the
most common formatters will recognize as a token end is "-->".
I started out by trying to modify lexer_text_html.l, but I was unable to
make it work. Frankly, I am teaching myself lex as I go, and I made changes
that caused error messages, which I was then unable to debug. So I decided
to simplify my life by making the changes in a separate parser, where all
changes were in a single file. That is the way it is now. You *should* be
able to get http://majordomo.squawk.com/njs/html_tokenize.l, then
flex html_tokenize.l
cc yy.lex.c -g -o html_tokenize
and finally:
./html_tokenize < some_input_test_file. Note that the file should be plain
text - no quoted printable or base64. I was testing with a hotmail file
that was text/html in the RFC822 headers.
Like I said, I could not figure out what I was doing wrong, so I decided to
start a new parser just to play. I have
http://majordomo.squawk.com/njs/html_tokenize.l as a copy of what I have
gotten done. The basic parser works, and the putback code works (this is
not how I would actually write the final version, with a strdup and free
for every putback, although you need to copy before you start the unput()
stuff for maximum flexibility). But the patterns are how I would write
them. I do not know if it would be a good idea to try and do quoted
strings inside the html tokens. That would not be horrible, it would
require an additional state inside the html state, but I would have to
determine how the various renderers handled quoted strings inside of tags.
Finally, there has been strong feeling that we want to parse the contents
of html as tags. I believe this myself, and also believe that tag
population might be a classification scheme. I am using lex states to
manage the condition of being inside a tag and parsing tokens from inside
the tag. I believe that it would be a good idea to keep the tags as <tag
and </tag because that is both easy, obvious, and it allows for simple
processing.
I think that we should consider stopping the differentiation between html
comments and other html tags and that we should always do the html
processing. (Other than the closing of the comment).
I used http://www.w3schools.com/tags/tag_phrase_elements.asp, as the list
of all possible tags, and I classified the html markup tags listed there
into "breaking" and non-breaking. I believe that the following lex regular
expression, prepended with "<", matches all of the html markup tags that
cause format breaking. That is, a stand-alone </pre> tag will cause a
break in formatting even if there is no <pre> tag. So if there are two
tokens with HTML tags but no non-captured whitespace between them, and one
or more of the html tags matches the below regexp, then there will be
"eyespace" between the tags, whereas if none of the html tags match the
tags between the two tokens, then there will be no eyespace between the
tokens, and they should be processed as a single token.
p|br|li|h[1-6]|hr|title|table|center|dd|dt|iframe|img|input|select|td|textarea|th|\/?(div|blockquote|pre|dir|dl|fieldset|legend|form|menu|ol|ul)
I am probably wrong about this regular expression, and I invite
correction. Please e-mail me a sample and the browser you used. I
"preferred" to use IE for my testing, but I checked some in Lynx as
well. I changed the comment processing to match IE and Netscape's comment
processing - they will only recognize <!-- as the beginning of a comment
and --> as the end. As it turns out, you can code comments as: <! -- -- >
if you want, and that will *happen* to work, because the processing that is
going on is simply the "unknown tag" processing. <! is an unknown tag,
which then terminates at >, while <!-- is a comment and only terminates at
-->. I handled that by adding a state that is entered when a comment is
entered, which you can only exit by matching -->.
These are the changes I believe should be made:
1. I believe that the current code that is used to remove comments from
html outside of the lex process should be deleted.
2. I believe that there should be, at most, a single switch that controls
all reordering. "At most" allows for less than one switch, so that the
code would be written to always process html as html and to reorder. If
the decision is made to not reorder, then the switch should control a state
in the Lex parser, that then selects simpler actions and patterns that
allow for the separation of tokens in a simpler way so that overhead is
lowered.
3. I believe we should process comments the IE way. Again, this goes back
to eyespace. I believe that the contents of comments (and other html tags)
should be processed for tokens, but the question is what you relocate and
move and so forth.
Now, I am still unsure about one thing. There are tags, like, for example
<script></noscript> that can be inserted as a pair. You can use the tag to
elide things. "spam<scipt> </noscript>mish" looks like a broken word (even
to my code, because I will break at the space)...but it will be rendered,
in "eyespace", as a single word. I believe this will work in lynx...or in
IE with scripts disabled or enabled. My first inclination is to not allow
for this because it will slow the parser down and we have seen no one use
this particular approach. If someone codes it as
"spam<script></noscript>mish" my scheme will collapse it to "spammish", but
I am making the assumption that contained white space is white space, if it
is not inside a tag. There are a few tags like this that can be used to
elide. It is certainly possible to add states, but to do this as
"lookahead" would require that you put these tags in pairs. But everything
between a style and /style gets elided, or script and noscript. Even
<title> and </title> will elide what is between them without causing a break.
There are probably quite a few of these, and I suspect that the regular
expression that which would check for all of these would be seriously long.
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list