auto-update in 0.16.2

Sat Jan 17 14:10:54 CET 2004

On Sat, 17 Jan 2004, David Relson wrote:

> The underlying principle of auto-update ("-u") is that bogofilter _can_
> expand its ham and spam database.  Having done this for over a year, I
> recently noticed that most of the messages are scored 0 or 1 (to several
> significant digits).
> 
> Thinking of these messages as "very, very easy to classify", I'm
> guessing that they offer very little unknown information, which makes
> them of little value in training. [...]

The fallacy is that the spamicity does not tell us anything about
entropy (which is the "surprise factor" of a symbol in a message -
predictable symbols have little to no entropy, unique symbols have high
entropy).

Unknown symbols, which, by nature have a high entropy for the current
wordlist.db and might be more interesting, will score at a spamicity of
ROBX and be ignored by means of min_dev.

What this update-limiting parameter does is assume that distinct
ham/spam will not be used to teach the data base what other tokens are
hammish or spammish, but we'd rather use the stuff that is on the verge
of being unsure.

OTOH, if a mail to be trained for has few unknown tokens, training
doesn't cost much, where cost is measured in tokens newly added to the
data base with spamcount 1 and hamcount 0 or the other way around.