leser state machine [was: uuencoded attachments produce woe]
David Relson
relson at osagesoftware.com
Sun Dec 8 16:06:57 CET 2002
At 09:13 AM 12/8/02, Matthias Andree wrote:
>I'm wondering if it would be hard to make lexer.l parsing stateful; no
>need to split that stuff.
>
>My idea is: parse MIME-Header (REQUIRED!), Content-Type (multipart or
>message are the interesting primary types here),
>Content-Transfer-Encoding and if multipart, toss the boundary line on a
>stack, and the boundary parser will look at the stack to figure if it's
>a valid boundary line and how many levels to pop from the stack -- we'd
>only need to be able to push one token back to the parser and block a
>particular rule if we want it to be done in a simple way and how much
>performance and size impact that has (if we need REJECT rules, for
>example).
Matthias,
I think we're moving in that direction, i.e. stateful. The lexer already
uses states. Two examples of that are the past_header and url: code. I
see more of that coming.
A couple of times in the past decade, I've used yacc to advantage. As a
tool I don't use much or know well, I've found it difficult to use.
Gyepi wants to use the eps library. He's used it before and praises it. I
need to go read about it. FYI, the url is http://inter7.com/eps and eps
means "Email Processing System".
On the subject of boundary lines, every once in a while I dump my wordlists
with bogoutil and look at the output. In addition to normal tokens, I see
lots of korean, numbers, urls, and boundary lines. They all use space and
I don't think they significantly help classification. Urls may be useful,
but I doubt the others are of value.
More information about the Bogofilter
mailing list