New version

Tom Allison tallison at tacocat.net
Tue Mar 16 23:56:26 CET 2004


Greg Louis wrote:
 > On 20040316 (Tue) at 0628:15 -0600, Bill McClain wrote:
 >
 >>On Tue, 16 Mar 2004 06:54:48 -0500
 >>Tom Allison <tallison at tacocat.net> wrote:
 >>
 >>
 >>>Perhaps something to the effect of:
 >>>min_dev: you want to make sure your robx value is within the range of
 >>>0.50 +/- min_dev.  If you don't then all your new tokens will be
 >>>included in calculations and have strange results on your scores.
 >>
 >>I don't think that's generally true.
 >
 >
 > I agree.  There are situations -- a small training database is one of
 > them -- where it makes sense not to consider unknowns, and making sure
 > min_dev excludes them is a way to avoid swamping your valid tokens with
 > priors.  But some of us find that a very small min_dev works really
 > well (bogotune will tell you, if you have enough messages to run it).
 >

Yes this is exactly my point.

I am not in a position to run bogotune and so I have to fiddle manually.
While the bogofilter.cf file provides me with the means to readily
fiddle to my hearts delight, I do not have sufficient emails to run
bogotune or to do such inherently risky settings as you have (as
supported by other comments on this list).

I can't possibly dispute the effectiveness of your settings, however I
think as a general guideline it might be a good idea to keep robx
withing the 0.5 +/- min_dev range just as it's a good idea to not set
spam_cutoff to 0.40 right away.

---

On a side note IIRC most people have an diction of <30,000 words with a
common set of 10,000 in their daily language.  Considering thay my one
month old wordlist exceeds this by almost 4x it's no surprise that
(according to 'bogoutil -H')
hapaxes:  ham   11899 ( 8.35%), spam   52460 (36.83%)
    pure:  ham   24179 (16.98%), spam  107558 (75.52%)

My 'pure ham' approximates the diction limit of a reasonably educated
person.  In my case there's probably a lot of tokens that aren't
linguistically significant (eg: head:AntiVirus 35 110 20040310) but
there's a limit from a language perspective that you approach as you
collect ham.
The spam content, using random letters and mis-spellings for variations
will far exceed any typical language on the planet.
Given that and the concept of setting robx withing 0.5 +/- min_dev
effectively negates all the random values they use for "spin control" of
their email.

Additionally, if you really wanted to get weird, I would run a spell
checker on the Body of the email and reject any email > 50% typo.  But
that's not bogofilters function.  That's for someone else to do.  But I
bet it would work really really well.





More information about the Bogofilter mailing list