Radical lexers

Tue Jan 20 17:58:26 CET 2004

David Relson wrote:

>> > File bf.tmp, generated by the standard lexer, has 5005 lines
>> > (tokens) in it.  File pi.tmp, generated by your lexer, has 5918
>> > lines.  That's an increase of almost 1/5.  Many of the differences
>> > are tokens like 0.408692, 0.410978, 0.412559, 0.412734, 0.413214,
>> > 0.416318, 0.418804, etc. which seem unlikely to recur.
>> 
>> As a first shot I would assume there are also a lot of 1-
>> and 2-byte-tokens. Actually, numbers were useful in my tests
>> IIRC. At least I could not refute that.
>> 
>> But looking at your example it seems you took the former
>> lexer of mine. The latter does not allow period in a token.
> 
> Disallowing periods would change the tokens to 408692, 410978, 412559,
> 412734, 413214, 416318, 418804, etc -- not a significant difference,
> AFAICT.

Not much, right. it might have effects for typical prices,
though.

>> > This seems contrary to
>> > your efforts to minimize wordlist size :-(
>> 
>> Changing the lexer was not really to minimize wordlists. It
>> was more for the pure sake of simplicity. If this is good or
>> bad is not clear from itself. I just can say it works for me.
> 
> It does seem that almost any reasonable lexer produces good results. 

Yes.

> A
> year ago, bogofilter was doing fine without knowledge of mime, html,
> etc.  It seems that lexer details aren't all that important.

I have seen a huge jump in precision with MIME and HTML. But
as soon as we end up with "text" it looks fine.

pi