Bogofilter -u ??? - thread: 0.93.1 and DB_CONFIG flags

Fri Nov 12 22:23:05 CET 2004

On Fri, 12 Nov 2004 13:45:17 -0700
Michael Gale wrote:

> Hello,
> 
> 	I was reading the thread "0.93.1 and DB_CONFIG flags" and a
> 	comment was 
> made:
> 
> --snip--
> bogofilter -u is a hog and not well understood from the
> scoring/learning point of view. The ostensive argumentation of
> "learning from known spam" or such isn't founded on figures, and I
> don't dare offer my gut feeling on this topic.
> 
> It is however well understood that -u entails write mode even for
> scoring because of the subsequent registration, entails queueing
> behind locks at least for the page holding .MSG_COUNT, log writes,
> data base growth with unknown gains for the scoring accuracy.
> 
> I'd currently discourage the use of the -u option and if only for
> performance reasons.
> --snip--
> 
> 
> I checked the bogofilter.html file in the doc directory and it does
> warm against e-mail slow downs on large systems or under heavy load.
> 
> Besides the added system load ... do people find this option helps ??
> 
> I am currently using Bogofilter with the -u -p -e option on my postfix
> 
> gateway box and any messages that are misclassified got put into a 
> public folder and are automatically reclassified.
> 
> With out this option your wordlist would only get updated on error ? 
> Would the "always training" be more accurate ?
> 
> Thanks.
> 
> 
> -- 
> Michael Gale
> Lan Administrator
> Utilitran Corp.

Hello Michael,

I've been using '-u' since it was available and have been pretty happy
with it.  After a while I noticed that the database was growing quite
rapidly (due to the need to split pages when inserting tokens).  I also
noticed that a large percentage of incoming messages were (to
bogofilter) obviously ham or obviously spam, i.e. scored as 0.000000 or
as 1.000000.  

I also got to wondering what would happen if bogofilter had a threshold
parameter so that messages scoring close to 0 or to 1 would _not" cause
a data base update.  One result would be less database growth.  Another
result would be slower "learning" of hammish and spammish tokens.  I
realized that _not_ adding some of the '-u' tokens would cause the
number of unknown tokens to increase and that would likely cause scores
to drift away from 0 and 1 towards 0.5.  Once scores drifted beyond the
threshold value, bogofilter would autoupdate and that would provide
compensation (for not registering _every_ spam/ham).

So, that's the theory.  In practice, I've been using
"thresh_update=0.01" in /etc/bogofilter.cf since February.  My FN rate
(missed spam) hovers about 1 in 700.  I've seen 7 FP's since then, which
is a FP rate of 1 in 20,000.

For efficiency, '-u' _does_ open the wordlist with write permission and
that _is_ slower.  With my 0.01 threshold, the number of updates has
dropped by 90%.

I am _very_ conscientious about correcting classification errors and
that has helped keep bogofilter's accuracy high.

Summary:  you pays your money and you makes your choice :-)

HTH,

David

With a threshold of 0.01 (for example)