Using the -u option and database size

Tom Anderson tanderso at oac-design.com
Wed Mar 21 20:30:20 CET 2007


John G Walker wrote:
> 
> On Wed, 21 Mar 2007 06:27:22 -0500 Bill McClain
> <wmcclain at salamander.com> wrote:
> 
> 
>>On Wed, 21 Mar 2007 10:42:41 +0100
>>Peter Gutbrod <lists at media-fact.com> wrote:
>>
>>
>>>So far I used the -u option with bogofilter. Meanwhile my
>>>wordlist.db has grow up to about 200 MB and I'm wondering, whether
>>>it puts much load onto the server to match each mail against such a
>>>big database.
>>>
>>>I think the size is mainly due to the automatic registering with
>>>the -u option.
>>>
>>>So what do you think? Is it better not to use the -u option to keep
>>>the database small? Or so you think a 200MB database is not a
>>>problem even on a production mailserver that is receiving thousands
>>>of (spam) email every day.
>>
>>You might look into the "threash_update" parameter: 
>>
>>#       Skip autoupdating if the spamicity is within this value
>>#       of 0.000000 (surely ham) or 1.000000 (surely spam).
>>
>>I use the default of 0.01, meaning messages with spamicity greater
>>than 0.99 and hams less than 0.01 are not registered. This cuts down
>>autoupdate registrations by a large factor; maybe 1/10?
>>
>>The idea is that well-recognized messages do not need to be
>>registered. You do miss some new tokens and counts that might be
>>useful in the future, but in practice I've found that accuracy is not
>>harmed.
>>
> 
> 
> I seem to have missed this parameter, which looks very useful. So thanks
> for that.
> 
> Bogofilter very quickly pushes the spamicity of recognised messages to
> the extremes, so it would not be unreasonable to set this parameter to
> 0.1, cutting down registration by an even greater amount,
> 

I've been running bogofilter for years with the -u option enabled and I 
receive at least 100-200 spams per day.  I also use thresh_update set at 
0.01.  My wordlist is at 78MB.  It grew from 76MB in my October backup, 
which I consider negligable growth over the course of nearly half a 
year.  I think that wordlist size will tend to obey Beer's Law.  Sooner 
or later you start to saturate the number of unique tokens you will ever 
see and growth therefore slows significantly.  I also eliminate a lot of 
the foreign tokens I might otherwise be exposed to by using DNS block 
lists before bogofilter.

Tom




More information about the Bogofilter mailing list