Radical lexers

Thu Dec 11 00:32:29 CET 2003

On 11 Dec 2003 09:29:08 +1100
michael at optusnet.com.au wrote:

> "Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> > [Corrected version]
> > 
> > This is a very short test only. I compare my version (a) of
> > the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> > much stricter version of it (b). TOKEN will effectively be
> > of the form
> > [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
> [...] 
> > Here is what I get:
> >       wordlist  false neg       false pos
> > a)    27060k    210/13612       16/15670
> > b)    26832k    206/13612       17/15670
> > c)    23332k    210/13612       18/15670
> > 
> > So the size is a surprise. I expected something much smaller
> > for b) and even more for c).
> 
> This isn't super suprising. You're testing with a small corpus,
> on a very easy data set. You're well down in the noise level
> on both fp's and fn's.
> 
> I'd be curious to see the difference with a tougher dataset
> (specifically, a dataset that includes hams to many
> people :)
> 
> Michael.

Hi Michael,

Greg and I have just such a project for you to participate in :-)  He's
collecting corpora from several people and is planning to run them all
through bogotune.  The goal is to generate new default parameters for
bogofilter -- parameters that do a demonstrably good job on a wide
variety of messages.  Bogofilter's current defaults are based solely on
our (Greg's and mine) corpora of a year ago.  We want something based on
bogofilter's current parsing and scoring.  Would you care to
participate?  If so, I'll be glad to send you the needed details.

Cheers!

David