tuning and archives

Tom Allison tallison at tacocat.net
Mon Feb 23 14:04:14 CET 2004


I tried downloading some of the archives of spam.
It didn't go very well.  While I was able to run my filters OK: spam:ham 
tokens was > 38:1 when it was done.  But the number of actual emails 
processed was not more than 2:1 of my ham email count.

bogofilter was noticeably dumber after this.  Many corrections.
So, I threw everything out and rebuilt by training against my ham/spam 
archives.

Works great, but I've noticed again the my ham count << spam count in 
tokens.  Not as much so when it was broken, but I will be interested to 
see if the trend continues.  My database is about one week old.

I'm worried that if the imbalance gets too great, it will start to drop 
in accuracy.

I'm currently configured to '-u' all email.  It appears that this 
inbalance in the histrogram may give a visual reason why you might not 
want to do that all the time since it might augment the imbalance.

My histogram now looks like:
score   count  pct  histogram
0.00    12442 12.55 ########
0.05       56  0.06 #
0.10       91  0.09 #
0.15      116  0.12 #
0.20      156  0.16 #
0.25      174  0.18 #
0.30      213  0.21 #
0.35      279  0.28 #
0.40      166  0.17 #
0.45      378  0.38 #
0.50      200  0.20 #
0.55      306  0.31 #
0.60      139  0.14 #
0.65      711  0.72 #
0.70      342  0.34 #
0.75      605  0.61 #
0.80      224  0.23 #
0.85      662  0.67 #
0.90      473  0.48 #
0.95    81409 82.11 ################################################
tot     99142
hapaxes:  ham    6246 ( 6.30%), spam   31199 (31.47%)
    pure:  ham   12398 (12.51%), spam   81215 (81.92%)





More information about the Bogofilter mailing list