religion

Wed Jan 22 21:37:51 CET 2003

Nick,

There is one factor, possibly important, that hasn't been mentioned in the 
'-u', "train on error", "ham/spam/unsure" discussions.  That factor is 
wordlist size.

When using '-u' all incoming tokens go into one wordlist or the 
other.  This also happens if you use another method to "train on all 
messages".  Obviously, doing this increases the number of tokens in the 
wordlists and their counts.  The numbers will increase faster with "train 
on all" than with "train on error".

Is this good or bad?  Not clear.

On the plus side, bogofilter will learn faster which words to associate 
with ham and which with spam.  Think of this as "guilt by 
association".  It's useful because the next message to be classified may 
use the "tainted" words as the decision makers.  This seems like a good thing.

On the bad size, you have wordlist size.  If a new kind of spam comes in, 
it may be classified as ham because that's what the tokens indicated.  So, 
"-S" is used to put in a wordlist correction.  However, a single correction 
like this may not be enough to significantly change the token 
probabilities.  Then, in comes a second spam message of the same kind.  It, 
too, is classified as ham and needs "-S" correction.  The problem here is 
that large wordlists with large token counts may take a while to "learn" 
about the new kind of spam.  Stated differently, a large wordlist has a 
type of momentum which may interfere with its ability to learn quickly, 
i.e. change direction.

So, to get back to your question:  "Is -u good or bad?"  The answer seems 
to be "You pays your money and you makes your choice."  Either way 
works.  Pick the one that seems best to you.

David