Radical lexers

Thu Dec 11 08:40:42 CET 2003

michael at optusnet.com.au wrote:

>'a' is the 0.15.9 lexer with TOKEN replaced with '[[:alnum:]]+'

That is not enough. You also need to fix other places.

>There's a point here that really needs to be remembered.
>
>Small corpus' work better with simple lexers.
>Larger corpus' work better with complex lexers.

I am not sure if this is robust.

>Long explanation:
>
>This is pretty obvious when you think about it. The bogofilter
>implementation of the bayes algorithm suffers (as almost all
>implementations do) from quantization noise. That quantization
>noise is worse when the token counts are low.

:>In my opinion, this is the key reason that train-on-error is a bad
:>idea. It's leaves the token counts low, and thus maximizes the
:>quantization error for a given result. I think. :)

It's nice to have such a theory, but it contradicts praxis.

>That is, when the token count is 1, the quatization error is a very
>large percentage. This is because we're counting discrete events to
>approximate a continuous probability. 

That is what robx and robs are good for.

>In general, the more complex the lexer and the higher the subsequent
>token count from a given amount of text, the lower the average token
>count. Thus, the quantization error per token will be higher.

That would just say, that simpler lexers like in the test
perform better.

>So when we add things like 'subj:' tagging et al, this will only be
>a net postive when it doesn't significantly impact the noise
>levels. Provided the corpus is 'large enough' this will be
>true. Unfortunately, the coverse also applies: Small corpus' will
>be adversely affected by things like header tagging et al.

Actually, this is not what we observed, when those tags were
introduced.

>Bottom line: It would be worthwhile testing proposed changes with both
>small and large corpus'.

We could test all special features this way. But I doubt we
are able to do it.

pi