tuning and archives

Mon Feb 23 14:25:43 CET 2004

On Mon, 23 Feb 2004 08:04:14 -0500
Tom Allison wrote:

> I tried downloading some of the archives of spam.
> It didn't go very well.  While I was able to run my filters OK:
> spam:ham tokens was > 38:1 when it was done.  But the number of actual
> emails processed was not more than 2:1 of my ham email count.
> 
> bogofilter was noticeably dumber after this.  Many corrections.
> So, I threw everything out and rebuilt by training against my ham/spam
> 
> archives.
> 
> Works great, but I've noticed again the my ham count << spam count in 
> tokens.  Not as much so when it was broken, but I will be interested
> to see if the trend continues.  My database is about one week old.
> 
> I'm worried that if the imbalance gets too great, it will start to
> drop in accuracy.
> 
> I'm currently configured to '-u' all email.  It appears that this 
> inbalance in the histrogram may give a visual reason why you might not
> 
> want to do that all the time since it might augment the imbalance.

Hi Tom,

We've always recommended using your site's ham and spam.  Your
"experiment" confirms the wisdom of the recommendation ;-)

A while ago I noticed that most of my incoming ham and spam were rated
at 0 and 1.0 respectively.  Having "-u" update the wordlist with those
"easy" messages might not be useful.  I added a "thresh_update" config
option so that auto-update could be suppressed for really high and low
scores.  

At present I'm using "thresh_update=0.01" so messages that score below
0.01 or above 0.99 don't go into the wordlist.  One can expect that this
will eventually result in messages scoring at 0.02 and 0.98, but this
should auto-correct.

Enjoy!

David