Radical lexers

Thu Dec 11 05:58:47 CET 2003

michael at optusnet.com.au writes:
> "Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> > [Corrected version]
> > 
[...] 
> This isn't super suprising. You're testing with a small corpus,
> on a very easy data set. You're well down in the noise level
> on both fp's and fn's.
> 
> I'd be curious to see the difference with a tougher dataset
> (specifically, a dataset that includes hams to many
> people :)

To follow up to myself:

a)      fp 8/75,000     fn 4104/75,000
b)      fp 9/75,000     fn 3322/75,000
c)      fp 5/75,000     fn 2231/75,000

'a' is the 0.15.9 lexer with TOKEN replaced with '[[:alnum:]]+'

'b' is the virgin 0.15.9 lexer.

'c' is the 0.15.9 lexer amended to emit token pairs as well as the
original tokens. (i.e. 'a b c' emits 'a','a*b','b','b*c', and 'c');

This corpus is 100,000 hams (drawn from a variety of mailboxes) and
100,000 spams (again, from a variety of mailboxes).

25,000 each of the hams and spams were randomly extracted from
the list and used to train bogofilter. Bogofilter was then
run against the remainder of the corpus to generate the numbers
above. (I actually did 15 runs with varying samples; the numbers
above are the median).

>From this you can clearly see that just using 'alnum' is a 
substantial degradation from the normal lexer for the corpus
in question.

There's a point here that really needs to be remembered.

Small corpus' work better with simple lexers.
Larger corpus' work better with complex lexers.

Long explanation:

This is pretty obvious when you think about it. The bogofilter
implementation of the bayes algorithm suffers (as almost all
implementations do) from quantization noise. That quantization
noise is worse when the token counts are low.

That is, when the token count is 1, the quatization error is a very
large percentage. This is because we're counting discrete events to
approximate a continuous probability. 

A count of one really represents the range (0.5 , 1.5 ], thus the
expected error is around 25%. (I'm handwaving a lot here. There's a
bunch of detail I just skipped, the most important being that this is
actually a distribution rather than a range, but those detail don't
effect the result so bear with me. :)

Compare to a count of 100. The range represented is now (99.5 , 100.5 ]
but the expected error is now around 0.25% of the value.

In general, the more complex the lexer and the higher the subsequent
token count from a given amount of text, the lower the average token
count. Thus, the quantization error per token will be higher.

So when we add things like 'subj:' tagging et al, this will only be
a net postive when it doesn't significantly impact the noise
levels. Provided the corpus is 'large enough' this will be
true. Unfortunately, the coverse also applies: Small corpus' will
be adversely affected by things like header tagging et al.

Determining the absolute size of 'large enough' is left as
an exercise for the student.

Bottom line: It would be worthwhile testing proposed changes with both
small and large corpus'.

Michael.