tuning and archives

Tue Feb 24 00:47:11 CET 2004

David Relson wrote:
> On Mon, 23 Feb 2004 08:04:14 -0500
>>I'm currently configured to '-u' all email.  It appears that this 
>>inbalance in the histrogram may give a visual reason why you might not
>>
>>want to do that all the time since it might augment the imbalance.
> 
> 
> Hi Tom,
> 
> We've always recommended using your site's ham and spam.  Your
> "experiment" confirms the wisdom of the recommendation ;-)
> 

For the record, I'm running 100% on about 600 emails if you consider 
Unsure to be a no_count result.  If you count Unsure, I'm 2 for 600.

Plugging in the archive spam records pretty much wrecked my filter.
So the phrase YMMV really holds true here.

I will probably turn off the '-u' option as soon as I have enough 
content to run bogotune from time to time.  But I don't want to have my 
human influence affect the statistical yet, bogotune says:
The wordlist contains 1224 non-spam and 546 spam messages.
Bogotune must be run with at least 2000 of each.

> At present I'm using "thresh_update=0.01" so messages that score below
> 0.01 or above 0.99 don't go into the wordlist.  One can expect that this
> will eventually result in messages scoring at 0.02 and 0.98, but this
> should auto-correct.
> 

So that is what it's for.
If I set it to 0.00 will it assume every spam/ham for inclusion to the 
wordlist?  How low of a value will it accept (significant digits)?

I'm uncertain about how continued reinforcement of tokens being what 
they really are would cause their accuracy to degrade.  Is it that these 
tokens become too assertive as ham/spam for sake of all other tokens? 
And as the result of which our filtering becomes similar to just doing 
procmails regex filtering on all things /viagra/ to go >/dev/null?