Radical lexers

David Relson relson at osagesoftware.com
Thu Dec 11 00:32:29 CET 2003


On 11 Dec 2003 09:29:08 +1100
michael at optusnet.com.au wrote:

> "Boris 'pi' Piwinger" <3.14 at logic.univie.ac.at> writes:
> > [Corrected version]
> > 
> > This is a very short test only. I compare my version (a) of
> > the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
> > much stricter version of it (b). TOKEN will effectively be
> > of the form
> > [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
> [...] 
> > Here is what I get:
> >       wordlist  false neg       false pos
> > a)    27060k    210/13612       16/15670
> > b)    26832k    206/13612       17/15670
> > c)    23332k    210/13612       18/15670
> > 
> > So the size is a surprise. I expected something much smaller
> > for b) and even more for c).
> 
> This isn't super suprising. You're testing with a small corpus,
> on a very easy data set. You're well down in the noise level
> on both fp's and fn's.
> 
> I'd be curious to see the difference with a tougher dataset
> (specifically, a dataset that includes hams to many
> people :)
> 
> Michael.

Hi Michael,

Greg and I have just such a project for you to participate in :-)  He's
collecting corpora from several people and is planning to run them all
through bogotune.  The goal is to generate new default parameters for
bogofilter -- parameters that do a demonstrably good job on a wide
variety of messages.  Bogofilter's current defaults are based solely on
our (Greg's and mine) corpora of a year ago.  We want something based on
bogofilter's current parsing and scoring.  Would you care to
participate?  If so, I'll be glad to send you the needed details.

Cheers!

David




More information about the Bogofilter mailing list