tri-state transition [was: bogofilter-0.93.0 - Significant Changes]

David Relson relson at osagesoftware.com
Sun Nov 7 16:35:11 CET 2004


On Sun, 07 Nov 2004 12:33:50 +0000
Robin Bowes wrote:

> David Relson wrote:

...[snip]...

> I've just upgraded - looks like I was the first to download from 
> sourceforge!
> 
> In the end I went with the new tags and modified my maildrop script to
> 
> look for "X-Bogosity: Spam".
> 
> I'm not sure that the "Unsure" status adds much to my situation. Let
> me discuss...
> 
> I run bogofilter in pass-through mode (-p) in a maildrop script, 
> filtering on the X-Bogosity header line. Spam (X-Bogosity: Yes|Spam)
> is put in the users' SPAM folder. Everything else (i.e. X-Bogosity: 
> No|Ham|Unsure) is delivered as normal.
> 
> I also use the -u option so messages are added to the wordlist 
> automatically.
> 
> If any spam is missed, the user can drop the message in a 
> SPAM/Undetected folder from where it is re-processed from a script run
> 
> from cron which essentially runs the message through bogofilter with
> the -Ns options.
> 
> Any messages mistakenly classified as Spam can be dropped in a 
> Spam/Misdetected folder from where they are re-processed using 
> bogofilter with the -Sn options.
> 
> So, I'm not sure how the "Unsure" classification contributes to my 
> configuration. I may even consider reverting to the binary
> classification.
> 
> Any suggestions?
> 
> R.
> -- 
> http://robinbowes.com

Hi Robin,

More good questions!  

I used the "-p -u" combination for a year or so. I noticed two things:
(1) my wordlist kept on growing, growing, ... and (2) many messages
scored as 0.000000 or 1.000000.

A partial solution for (1) was to control the size (somewhat) by
periodically compact the database (using bogoutil dump/load), though a
better solution would be to slow down the growth rate.

I theorized that (2) is a set of messages that are so clearly ham (or
spam) that little additional information is gained by training
bogofilter with them.  I added a "thresh_update=xxx" option to set an
update threshold so scores near 0 or 1 wouldn't auto-update and have
been running with "thresh_update=0.01" since early this year.  

One possible result is that scoring numbers differ more from 0 and 1,
i.e. messages start to score as 0.02, rather than 0.00.  When/if this
happens, bogofilter auto-corrects by training with the 0.02 message. The
net effect is that accuracy remains unchanged.

Using thresh_update has reduced wordlist.db's growth rate, as desired. 
It does not seem to have lessened accuracy -- for months now bogofilter
has been catching all spam, except for about 1 in 700.

Why do I mention all this?  Using tri-state classification lets me see
which messages bogofilter couldn't classify with certainty and lets me
manually classify them and be sure that bogofilter is properly trained
with them.

HTH,

David



More information about the Bogofilter mailing list