11 days

David Relson relson at osagesoftware.com
Mon Dec 8 15:20:19 CET 2003


On Mon, 08 Dec 2003 13:08:41 -0000
"Peter Bishop" <pgb at adelard.com> wrote:

> On 5 Dec 2003 at 8:04, David Relson wrote:
> 
> > > Hi!
> > > 
> > > I just received a spam mail. This is the first after a
> > > little more than 11 days. This is really satisfying. In this
> > > time more than 2500 ham (230/d) and 1500 spam (140/d)
> > > messages were classified correctly. This is certainly my
> > > personal best so far:-))
> > > 
> 
> It certainly beats me,
> I am getting approx 1 in 1000 for both ham and spam.
> Mind you I am a bit behind the cutting edge now
> (13.6.2 used with case insensitive mode)
> 
> I assume your database was constructed using your train-to-exhaustion 
> bogotrain script.
> 
> Maybe we could set up a performance league table on the web-site with 
> associated config and training details, so people can get a idea of
> what works best in practice.

Peter,

An interesting idea!  I'm sure the table will be interesting.  I also
predict it will show a wide variety of details -- each of which works
well for its location.

As you know, Greg and I have been recently been working a lot on
bogotune.  The initial goal was to make it faster, which has been done
by converting it from perl to C.  Now we're testing it with a number of
mail collections (mine, work and home for him, and collections from a
couple of users) and we're seeing that there are definite differences
between the collections.  For example, the best parameters for his 0624
collection and for my 1120 collections are quite different.  Below are
bogotune's findings of what's best for him and me:

his:
robx=0.600000
min_dev=0.020
robs=0.0316
spam_cutoff=0.873	# for 0.01% fp (1); expect 1.13% fn (159).
#spam_cutoff=0.824	# for 0.05% fp (6); expect 0.68% fn (96).
#spam_cutoff=0.780	# for 0.10% fp (12); expect 0.43% fn (60).
#spam_cutoff=0.758	# for 0.20% fp (25); expect 0.35% fn (49).
ham_cutoff=0.450	

mine:
robx=0.369138
min_dev=0.465
robs=0.0562
spam_cutoff=0.968	# for 0.05% fp (11); expect 6.64% fn (1180).
#spam_cutoff=0.629	# for 0.10% fp (23); expect 3.62% fn (642).
#spam_cutoff=0.500	# for 0.20% fp (47); expect 0.98% fn (174).
ham_cutoff=0.231	

We've come to learn that no single set of parameters is best for all
bogofilter sites.  Out goal is to use bogotune to determine a good set
of parameters to use for bogofilter's defaults.  We're collecting email
corpora for that use.  I'm looking forward to seeing the final results
;-)

David




More information about the Bogofilter mailing list