tuning and archives
Tom Allison
tallison at tacocat.net
Mon Feb 23 14:04:14 CET 2004
I tried downloading some of the archives of spam.
It didn't go very well. While I was able to run my filters OK: spam:ham
tokens was > 38:1 when it was done. But the number of actual emails
processed was not more than 2:1 of my ham email count.
bogofilter was noticeably dumber after this. Many corrections.
So, I threw everything out and rebuilt by training against my ham/spam
archives.
Works great, but I've noticed again the my ham count << spam count in
tokens. Not as much so when it was broken, but I will be interested to
see if the trend continues. My database is about one week old.
I'm worried that if the imbalance gets too great, it will start to drop
in accuracy.
I'm currently configured to '-u' all email. It appears that this
inbalance in the histrogram may give a visual reason why you might not
want to do that all the time since it might augment the imbalance.
My histogram now looks like:
score count pct histogram
0.00 12442 12.55 ########
0.05 56 0.06 #
0.10 91 0.09 #
0.15 116 0.12 #
0.20 156 0.16 #
0.25 174 0.18 #
0.30 213 0.21 #
0.35 279 0.28 #
0.40 166 0.17 #
0.45 378 0.38 #
0.50 200 0.20 #
0.55 306 0.31 #
0.60 139 0.14 #
0.65 711 0.72 #
0.70 342 0.34 #
0.75 605 0.61 #
0.80 224 0.23 #
0.85 662 0.67 #
0.90 473 0.48 #
0.95 81409 82.11 ################################################
tot 99142
hapaxes: ham 6246 ( 6.30%), spam 31199 (31.47%)
pure: ham 12398 (12.51%), spam 81215 (81.92%)
More information about the Bogofilter
mailing list