tuning and archives

David Relson relson at osagesoftware.com
Tue Feb 24 02:07:40 CET 2004


On Mon, 23 Feb 2004 18:47:11 -0500
Tom Allison wrote:

> David Relson wrote:
> > On Mon, 23 Feb 2004 08:04:14 -0500
> >>I'm currently configured to '-u' all email.  It appears that this 
> >>inbalance in the histrogram may give a visual reason why you might
> >not>
> >>want to do that all the time since it might augment the imbalance.
> > 
> > 
> > Hi Tom,
> > 
> > We've always recommended using your site's ham and spam.  Your
> > "experiment" confirms the wisdom of the recommendation ;-)
> > 
> 
> For the record, I'm running 100% on about 600 emails if you consider 
> Unsure to be a no_count result.  If you count Unsure, I'm 2 for 600.
> 
> Plugging in the archive spam records pretty much wrecked my filter.
> So the phrase YMMV really holds true here.
> 
> I will probably turn off the '-u' option as soon as I have enough 
> content to run bogotune from time to time.  But I don't want to have
> my human influence affect the statistical yet, bogotune says:
> The wordlist contains 1224 non-spam and 546 spam messages.
> Bogotune must be run with at least 2000 of each.
> 
> > At present I'm using "thresh_update=0.01" so messages that score
> > below 0.01 or above 0.99 don't go into the wordlist.  One can expect
> > that this will eventually result in messages scoring at 0.02 and
> > 0.98, but this should auto-correct.
> > 
> 
> So that is what it's for.
> If I set it to 0.00 will it assume every spam/ham for inclusion to the
> 
> wordlist?  How low of a value will it accept (significant digits)?

No value means "use all".  Internally the strtod() library function is
used.  Likely the value can be as small as you want.  I'm sure 1e-100
would be allowed, though I have no idea why you'd want such a small
value.

> I'm uncertain about how continued reinforcement of tokens being what 
> they really are would cause their accuracy to degrade.  Is it that
> these tokens become too assertive as ham/spam for sake of all other
> tokens? And as the result of which our filtering becomes similar to
> just doing procmails regex filtering on all things /viagra/ to go
> >/dev/null?

It's not continued reinforcing that's bad.  Adding of tokens causes the
wordlist to grow in size.  Not adding tokens from "obvious" messages
slows the growth rate.




More information about the Bogofilter mailing list