auto-update in 0.16.2

David Relson relson at osagesoftware.com
Sat Jan 17 13:38:25 CET 2004


On Sat, 17 Jan 2004 09:32:12 +0100
Andreas Pardeike <andreas at pardeike.net> wrote:

> On 2004-01-17, at 02.23, David Relson wrote:
> 
> > <snip> Over 90%
> > of them were "obvious" ham or spam, i.e. ham with scores < 0.01 or
> > spam with scores > 0.99.  Since the messages were so easily
> > categorized, it seems that there's little value in using them for
> > training.  Introducing
> > a config file option "thresh_update=0.01" and a corresponding
> > command line option "-u 0.01" seemed the obvious way of dealing with
> > this. Cutting the number of wordlist updates has the dual benefits
> > of making bogofilter faster and slowing the growth of the database.
> 
> Me, being new to bogofilter but somehow familiar with statistical 
> system,
> I wonder if it's a good strategy to not train easy detected messages.
> 
> Isn't it so that a new spam message containing a few very well known 
> words
> (thus getting a high spam score) is in fact a good train for all other
> details it contains. This would allow bogofilter to better catch 
> variants
> of spamming techniques.
> 
> Or am I totally off here?
> 
> Regards,
> Andreas Pardeike

Hi Andreas,

The "-u value" change has been reverted due to the side effect of
breaking existing scripts.  Config file option "thresh_update=0.01" will
remain for those who want to use it.

The underlying principle of auto-update ("-u") is that bogofilter _can_
expand its ham and spam database.  Having done this for over a year, I
recently noticed that most of the messages are scored 0 or 1 (to several
significant digits).

Thinking of these messages as "very, very easy to classify", I'm
guessing that they offer very little unknown information, which makes
them of little value in training.  I've been using "thresh_update=0.01"
for the last two days and the percent of messages not registering is
quite high.  Here are the counts for total messages received and number
of registrations:

      Thursday Friday
tot     699     715
reg      18      12

As you can see about 97% of the messages are affected by the 0.01 value.
 Over time, as spam changes, I expect there will be fewer messages above
the threshold of 0.99.  However as scores drop to 0.98 or 0.97, they'll
auto-update.  I'm betting that any efficiency lost will be self
correcting.

Since it's an option, it can be used (or not used) as judgment (or whim)
dictates.

David




More information about the Bogofilter mailing list