Database Size versus Shannon's Word Entropy

Matthias Andree matthias.andree at gmx.de
Tue Oct 24 22:36:53 CEST 2017


Am 23.10.2017 um 00:18 schrieb RW:
> On Sun, 22 Oct 2017 20:11:37 +0200
> Rick van Rein wrote:
> 
>> Hello,
>>
>> I was wondering why the Bogofilter database grows so large.  I have
>> one of around 44 MB, 
>>
>> So I plunged into the database.  That became a startling endeavour:
>>
>> * The database holds 739803 entries
> 
> not particularly large
> 
> ...
>> The majority of the wordlist is filled with one-shot words!  Looking
>> at a few, I found a!wƒà¹ and ARMDevices in my vocabulary.  
> ...
>> isn't this the Achilles' heel of Bayesian filtering?
> 
> No, and the fact that it doesn't much matter is one of its greatest
> strengths. It allows the filter to find the useful tokens for itself.
> 
> Lookup times should be relatively insensitive to database size - there's
> no intrinsic reason why they can't be independent of the number of
> tokens. The file size isn't very important until it becomes a
> significant fraction of RAM size.  

"Relatively" as in somewhat logarithmic, since most databases that
bogofilter uses use some form of B-whatever-tree, so a multiplication of
DB size with a factor causes a linear raise in access time.

We tried Berkeley DB Hash back then to get rid of that, but found that
BTree worked better overall.

>> * the same date is (currently) present in all entries
> 
> This is a limitation of bogofilter that the timestamp only gets updated
> when the token's counts are updated. But if you train both spam and ham
> regularly you can eventually age-out tokens, if you want, using
> bogoutil. 

Well, if we kept access times for "read" tokens, we might be in for
least-recently-used eviction but would massively increase the write volume.

Bogoutil also allows the user to filter out seen-only-once tokens
(lower-case -c option -- by age, see the -a option).  Not sure if it
really matters much: 44 MB seems small enough these days (it sure wasn't
when I built my first Linux PC on DX4 basis in the late 1990s).


More information about the bogofilter mailing list