bogus bogotuning

Wed Jan 28 15:43:31 CET 2004

On 20040128 (Wed) at 0917:48 -0500, Clint Adams wrote:

> Is it possible for bogotune to output a number indicating the statistical
> reliability of the results?

It would be great, but not at all easy.  The number itself would be
highly dependent on not only the size of the message corpus but the
homogeneity of the spam and nonspam and the relative smoothness and
flatness of the grid surface.  I'm not sure how I'd set about
calculating such a thing; it would be considerably more complicated
than the bogotune run itself.  If one were to rewrite bogotune to do a
nonlinear least-squares fit of the surface instead of just scanning it
and looking for minima, stats would fall out of that; but we'd be back
to runs that take days to complete, as they used to do when bogotune
was an R or perl script.

As a _very_ rough indication: if bogotune comes up with recommendations
for 0.01, 0.05, 0.1 and 0.2% fp, the suggestions are probably good; if
only 3 levels are given, they're still worth trying; two levels, well,
maybe; just 0.2% and they should be treated with caution; and if only
one level greater than 0.2% is reported, the run is not trustworthy.
How do I know that?  Just from experience; my estimate of reliability
for any given run will vary within those categories, as a gut feeling
that takes the abovementioned factors (corpus size and surface shape)
into account.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |