[glouis at dynamicro.on.ca: Re: New vs Old]

Thu Mar 25 15:20:59 CET 2004

Meant to send this back to the list:

----- Forwarded message from Greg Louis <glouis at dynamicro.on.ca> -----

Date: Thu, 25 Mar 2004 09:19:29 -0500
From: Greg Louis <glouis at dynamicro.on.ca>
To: David Relson <relson at osagesoftware.com>
Subject: Re: New vs Old
Reply-To: Greg Louis <glouis at dynamicro.on.ca>
In-Reply-To: <20040325075009.7c42c167 at osage.osagesoftware.com>
Organization: Dynamicro Consulting Limited

On 20040325 (Thu) at 0750:09 -0500, David Relson wrote:
>                 cur     new
> robs            0.010   0.0178
> robx            0.415   0.52
> min_dev         0.1     0.375
> spam_cutoff     0.95    0.99
> ham_cutoff      0.00    0.00    (bi-state)
> ham_cutoff      0.10    0.45    (tri-state)
> 
> I'm noticing 3 of the differences and am wondering about them.  
> 
> First, robx is changing from slightly hammish to slightly spammish.  Our
> traditional preference of false negatives (rather than false positives)
> has had us prefer a hammish value.

No, in fact it's the spam cutoff that determines that balance. 
Unknowns are excluded by both sets, and the tiny s values ensure that
no significant prior weight is given to low-count tokens.

> Second, min_dev has increased significantly.  This can be thought of as
> changing scoring from ignoring neutral tokens to using extrema.

Well, 0.45 might qualify as using extrema; this is just using the
outside quarter of the f(w) range.  FWIW it's likely to be _better_ for
a small training database: if you're going to allow low-deviation
tokens to influence your scores, it's a good idea to have a lot of
them.

> Third, the increased spam_cutoff seems more conservative, i.e. a message
> has to score really high to be labeled spam.

The distribution of scores changes with the choice of parameters.  To
get Andrew's and my corpora to yield reasonable fp counts with those
parameters requires a cutoff of 0.99; your corpus probably needs 0.998
or higher, though the fn count would be pretty awful if you switched to
that parameter set.  A spam cutoff is "more conservative" only if it
produces fewer fp with the new parameters than the old spam cutoff did
with the old parameters.

> There are some big differences between this tuning run and the original
> effort in 2002.  We've got lots more experience and we know more.  In
> addition we have collected a large test corpus for testing.  Lastly, we
> have bogotune -- our search and detect tool for parameters.

The main difference, I think, is that the original effort had a far
more homogeneous (therefore less general) email population to work
with.  We also had worse tools and less knowledge, but I think the data
matter most.

> Considering all this, and the changes in min_dev and spam_cutoff (in
> particular), I'm wondering if bogofilter might need different parameters
> for large and small wordlists.  

That's why I suggested some of the other users try them.

> Unfortunately, I don't know how to do a meaningful test with small
> wordlists.

Easy.  Using the small wordlist, a user should evaluate a couple
thousand messages with the new parameters and with what the user has
been using up till now, adjusting the spam cutoff to give similar fp
counts, and see if the difference in fn is striking.  If it isn't, the
new parameters are a plausible starting point.  If it is, I want to see
the data from a number of such attempts in order to try to figure out
why.

For this to work, though, the user needs to have experience of manually
tuning his small corpus over a period of time; I don't fancy just
building a number of artificial small training dbs and throwing a
couple thousand messages at them, because that way the choice of
"traditional" parameters is arbitrary and the conclusion is
problematic.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |

----- End forwarded message -----