[bogofilter] spamitarium & block_on_subnets results
tanderso at oac-design.com
Thu May 6 09:49:53 EDT 2004
From: <tallison at tacocat.net>
> Considering that I used the same configurations through all of the tests
> and training, the percentage of various scores may not be ideal when
> compared to other configurations. However, the important thing to note
> here is not the absolute value of the accuracy (or lack thereof) but the
> comparative differences between which one is better/worse.
The robx and min_dev play a role in the comparative differences. In your
configuration, if bogofilter has not seen a token before, then it will
assign it a value of 0.6. Normally we want such tokens to not play a role
in classification since we have no basis in experience on which to classify
them. However, your min_dev range is 0.45 to 0.55, so you're telling
bogofilter to classify every single new token as spammy. This is pushing
all of your scores toward the spam direction pretty drastically. And since
spamitarium reduces certain unwanted redundancy and introduces new helo-,
ASN, rDNS, and IP tokens, the spamitarium results likely have more hapaxes.
This is the core purpose of spamitarium, as when these tokens are seen
again, they help filter the email appropriately. However, when these tokens
are unknowns/hapaxes (being classified as robx for the first time), your
config values are causing them to push all emails (notably hams) into the
spam direction. This is not ideal behavior. A better test of spamitarium
(and a better useage of bogofilter, IMHO) is to keep your robx value within
the min_dev exclusion range. Otherwise, you're biasing the test against
spamitarium's core purpose. This same phenomenon is responsible for your
poor results with block-on-subnets, as again, there are more unknown tokens
under those conditions I believe.
> scored as Unsure than Ham scored as Yes. I consider Unsure scores to be a
> minor error and false scores to be very major errors.
I agree with this. However, your settings do not follow your philosophy.
Right now, if you get a ham email from someone discussing a topic with lots
of words unknown to your wordlist, and just a few spammy words thrown in,
then you'll probably classify it as spam due to the bias you've set up with
your robx and min_dev values. I would give such an email the benefit of the
doubt by not deciding on the unknown words until I've seen them more than
> ran, the ideal arguement would be to run all the training based on a
> configuration file that was generated by bogotune exclusively. But I'm
> not convinced that this is going to make much a difference in the end. I
> believe we are looking for statistically significant "shifts" in the data
> more than we are looking for specific target values of attribute/variable
> data. This perspective removes a dependency on the clause "YMMV".
You don't need to run bogotune. Just reduce your robx closer to 0.5,
preferably favoring ham slightly. And make sure it is closer to 0.5 than
min_dev. This is important to properly test spamitarium. The cutoffs don't
really matter from a comparative perspective, but classifying unknowns as
spam is detrimental.
More information about the Bogofilter