Radical lexers
David Relson
relson at osagesoftware.com
Tue Jan 20 17:55:27 CET 2004
On Tue, 20 Jan 2004 17:47:20 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> David Relson wrote:
>
> > Out of curiosity, I built bogolexers with the standard lexer_v3.l
> > and yours
>
> Just to be sure. Did you use
> http://piology.org/bogofilter/lexer_v3.l or
> http://piology.org/bogofilter/lexer_v3.l.radical (I use the
> latter now).
The former, rather than the latter.
> > File bf.tmp, generated by the standard lexer, has 5005 lines
> > (tokens) in it. File pi.tmp, generated by your lexer, has 5918
> > lines. That's an increase of almost 1/5. Many of the differences
> > are tokens like 0.408692, 0.410978, 0.412559, 0.412734, 0.413214,
> > 0.416318, 0.418804, etc. which seem unlikely to recur.
>
> As a first shot I would assume there are also a lot of 1-
> and 2-byte-tokens. Actually, numbers were useful in my tests
> IIRC. At least I could not refute that.
>
> But looking at your example it seems you took the former
> lexer of mine. The latter does not allow period in a token.
Disallowing periods would change the tokens to 408692, 410978, 412559,
412734, 413214, 416318, 418804, etc -- not a significant difference,
AFAICT.
...[snip]...
> > This seems contrary to
> > your efforts to minimize wordlist size :-(
>
> Changing the lexer was not really to minimize wordlists. It
> was more for the pure sake of simplicity. If this is good or
> bad is not clear from itself. I just can say it works for me.
>
> pi
It does seem that almost any reasonable lexer produces good results. A
year ago, bogofilter was doing fine without knowledge of mime, html,
etc. It seems that lexer details aren't all that important.
More information about the Bogofilter
mailing list