tri-state transition [was: bogofilter-0.93.0 - Significant Changes]

Greg Louis glouis at dynamicro.on.ca
Mon Nov 8 00:55:29 CET 2004


On 20041107 (Sun) at 1035:11 -0500, David Relson wrote:
> > 
> > So, I'm not sure how the "Unsure" classification contributes to my 
> > configuration. I may even consider reverting to the binary
> > classification.

> More good questions!  
> 
> I used the "-p -u" combination for a year or so. I noticed two things:
> (1) my wordlist kept on growing, growing, ... and (2) many messages
> scored as 0.000000 or 1.000000.

FWIW, appropriate values of the ESFs have the curious effect of moving
the peaks of spam and nonspam away from 1 and 0 _and_ trimming their
tails so that overlap is reduced.  The latter allows one to set a spam
cutoff such that false positives can be all but eliminated (one in
20,000 is achievable) while keeping false negatives to a reasonable
minimum (in my case, with one database for about 80 users, 0.6% or
thereabouts).

> I theorized that (2) is a set of messages that are so clearly ham (or
> spam) that little additional information is gained by training
> bogofilter with them.  I added a "thresh_update=xxx" option to set an
> update threshold so scores near 0 or 1 wouldn't auto-update and have
> been running with "thresh_update=0.01" since early this year.  

IMHO this was a really good idea.  In fact, it would be worth while to
test it extensively.  David has seen graphs of score distributions with
ESF that show the characteristics I mentioned above, and make it clear
that ESF would likely break the thresh_update option or at least
require its value to be significantly increased.  Thus the questions to
be tested are: (1) is thresh_update more effective than ESF if we
consider them mutually exclusive, and (2) if we allow them to coexist,
can we obtain a further improvement in accuracy?  I kinda shudder at
the thought of adding yet another degree of freedom to bogotune, but it
might be worth it if one has large message volume and lots of free CPU
cycles.

> Why do I mention all this?  Using tri-state classification lets me see
> which messages bogofilter couldn't classify with certainty and lets me
> manually classify them and be sure that bogofilter is properly trained
> with them.

Which is, of course, exactly what's desired.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |



More information about the Bogofilter mailing list