Database Size versus Shannon's Word Entropy

Mon Oct 23 00:18:54 CEST 2017

On Sun, 22 Oct 2017 20:11:37 +0200
Rick van Rein wrote:

> Hello,
> 
> I was wondering why the Bogofilter database grows so large.  I have
> one of around 44 MB, 
> 
> So I plunged into the database.  That became a startling endeavour:
> 
> * The database holds 739803 entries

not particularly large

...
> The majority of the wordlist is filled with one-shot words!  Looking
> at a few, I found a!wƒà¹ and ARMDevices in my vocabulary.  
...
> isn't this the Achilles' heel of Bayesian filtering?

No, and the fact that it doesn't much matter is one of its greatest
strengths. It allows the filter to find the useful tokens for itself.

Lookup times should be relatively insensitive to database size - there's
no intrinsic reason why they can't be independent of the number of
tokens. The file size isn't very important until it becomes a
significant fraction of RAM size.  

> * the same date is (currently) present in all entries

This is a limitation of bogofilter that the timestamp only gets updated
when the token's counts are updated. But if you train both spam and ham
regularly you can eventually age-out tokens, if you want, using
bogoutil.