Will spam/ham counts in wordlist affect spamicity?

David Relson relson at osagesoftware.com
Thu Sep 18 01:23:49 CEST 2003


Hi Chris,

Not having any experience with the results of merging wordlists, I can't
predict the outcome.  However, I do have several observations.

First to "add together" the lists, you can simply use bogoutil's dump
and load capabilities.  When loading, bogoutil adds the new count to any
existing account.  Thus to "merge" three wordlists is simply

	bogoutil -d one.db | bogoutil -l merge.db
	bogoutil -d two.db | bogoutil -l merge.db
	bogoutil -d thr.db | bogoutil -l merge.db

Second, rather than use perl to discard uninteresting tokens, you can
use min_dev to exclude tokens near to 0.5.  In the config file, you
could use "min_dev=0.3" or on the command line you could have
"bogofilter -m 0.3".

If perl is your solution, be sure to normalize your counts before
discarding.  If the ham and spam message counts are way different (for
example a 3::1 ratio), then a token with ham and spam counts of 33 and
11 is exactly neutral.  So, before comparing, it'd be a good idea to
divide the token's ham count by the ham message count and do the same
for the spam count.

David




More information about the Bogofilter mailing list