Radical lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Thu Dec 11 08:40:42 CET 2003
michael at optusnet.com.au wrote:
>'a' is the 0.15.9 lexer with TOKEN replaced with '[[:alnum:]]+'
That is not enough. You also need to fix other places.
>There's a point here that really needs to be remembered.
>
>Small corpus' work better with simple lexers.
>Larger corpus' work better with complex lexers.
I am not sure if this is robust.
>Long explanation:
>
>This is pretty obvious when you think about it. The bogofilter
>implementation of the bayes algorithm suffers (as almost all
>implementations do) from quantization noise. That quantization
>noise is worse when the token counts are low.
:>In my opinion, this is the key reason that train-on-error is a bad
:>idea. It's leaves the token counts low, and thus maximizes the
:>quantization error for a given result. I think. :)
It's nice to have such a theory, but it contradicts praxis.
>That is, when the token count is 1, the quatization error is a very
>large percentage. This is because we're counting discrete events to
>approximate a continuous probability.
That is what robx and robs are good for.
>In general, the more complex the lexer and the higher the subsequent
>token count from a given amount of text, the lower the average token
>count. Thus, the quantization error per token will be higher.
That would just say, that simpler lexers like in the test
perform better.
>So when we add things like 'subj:' tagging et al, this will only be
>a net postive when it doesn't significantly impact the noise
>levels. Provided the corpus is 'large enough' this will be
>true. Unfortunately, the coverse also applies: Small corpus' will
>be adversely affected by things like header tagging et al.
Actually, this is not what we observed, when those tags were
introduced.
>Bottom line: It would be worthwhile testing proposed changes with both
>small and large corpus'.
We could test all special features this way. But I doubt we
are able to do it.
pi
More information about the Bogofilter
mailing list