New version
Tom Allison
tallison at tacocat.net
Tue Mar 16 23:56:26 CET 2004
Greg Louis wrote:
> On 20040316 (Tue) at 0628:15 -0600, Bill McClain wrote:
>
>>On Tue, 16 Mar 2004 06:54:48 -0500
>>Tom Allison <tallison at tacocat.net> wrote:
>>
>>
>>>Perhaps something to the effect of:
>>>min_dev: you want to make sure your robx value is within the range of
>>>0.50 +/- min_dev. If you don't then all your new tokens will be
>>>included in calculations and have strange results on your scores.
>>
>>I don't think that's generally true.
>
>
> I agree. There are situations -- a small training database is one of
> them -- where it makes sense not to consider unknowns, and making sure
> min_dev excludes them is a way to avoid swamping your valid tokens with
> priors. But some of us find that a very small min_dev works really
> well (bogotune will tell you, if you have enough messages to run it).
>
Yes this is exactly my point.
I am not in a position to run bogotune and so I have to fiddle manually.
While the bogofilter.cf file provides me with the means to readily
fiddle to my hearts delight, I do not have sufficient emails to run
bogotune or to do such inherently risky settings as you have (as
supported by other comments on this list).
I can't possibly dispute the effectiveness of your settings, however I
think as a general guideline it might be a good idea to keep robx
withing the 0.5 +/- min_dev range just as it's a good idea to not set
spam_cutoff to 0.40 right away.
---
On a side note IIRC most people have an diction of <30,000 words with a
common set of 10,000 in their daily language. Considering thay my one
month old wordlist exceeds this by almost 4x it's no surprise that
(according to 'bogoutil -H')
hapaxes: ham 11899 ( 8.35%), spam 52460 (36.83%)
pure: ham 24179 (16.98%), spam 107558 (75.52%)
My 'pure ham' approximates the diction limit of a reasonably educated
person. In my case there's probably a lot of tokens that aren't
linguistically significant (eg: head:AntiVirus 35 110 20040310) but
there's a limit from a language perspective that you approach as you
collect ham.
The spam content, using random letters and mis-spellings for variations
will far exceed any typical language on the planet.
Given that and the concept of setting robx withing 0.5 +/- min_dev
effectively negates all the random values they use for "spin control" of
their email.
Additionally, if you really wanted to get weird, I would run a spell
checker on the Body of the email and reject any email > 50% typo. But
that's not bogofilters function. That's for someone else to do. But I
bet it would work really really well.
More information about the Bogofilter
mailing list