Incorrigible spam

Thu Apr 8 13:26:14 CEST 2004

On Thu, 08 Apr 2004 08:43:38 +0200
Boris 'pi' Piwinger wrote:

> Richard Kimber <rkimber at ntlworld.com> wrote:
> 
> >From time to time I get spam that is still scored as ham after I've
> >told bogofilter to relearn it as spam.
> 
> This seems to be an artifact of full training. However,
> besides the solutions already offered, those messages might
> be an indication that you did some false training in the
> past.

A large wordlist with lots of messages has a measure of "inertia".  With
such a list, each message is a very small percentage of the total.  This
can make it harder to change the status of a token from "hammish" to
"spammish".  You may be encountering that effect.

On the other hand, all my tests indicate that a larger database with
more tokens in it will do a better job of classifying messages.  Last
week I posted a test in which I scored my whole corpus (160,000
messages) and built a new wordlist with all messages having scores
neither 0 nor 1.  Using the new wordlist, I repeated the process a
second time.  Using the newer wordlist, I repeated the process a third
time, etc, etc, etc.  After each repetition, I counted how many ham
scored as ham, as unsure, and as spam and also counted how many spam
scored as ham/unsure/spam.  Scoring with a big wordlist produced good
results with few unsures and few errors.  Using the small set of unsures
from scores with a big list produced a small wordlist.  The results of
scoring with a small wordlist were many more unsures and a larger number
of errors.  The pattern of "small wordlist, more errors" and "large
wordlist, fewer errors" was very pronounces.

As pi says, a training error sometime in the past can also have a big
effect. Running "bogofilter -vvv < message" will list all the tokens and
their scores.  The FAQ describes "-vvv" output.  Of particular value in
a case like this, look at the last column ("+" or "-" depending on
min_dev) for which tokens were used to score the message.  Then look for
tokens that have been seen only once (or perhaps twice).  Check that
these tokens are correctly hammish or spammish.  If you have several
tokens with counts of 1 that are wrong, there's a good chance they came
from a single message used in a training error.