Database Size versus Shannon's Word Entropy
Matthias Andree
matthias.andree at gmx.de
Tue Oct 24 22:36:53 CEST 2017
Am 23.10.2017 um 00:18 schrieb RW:
> On Sun, 22 Oct 2017 20:11:37 +0200
> Rick van Rein wrote:
>
>> Hello,
>>
>> I was wondering why the Bogofilter database grows so large. I have
>> one of around 44 MB,
>>
>> So I plunged into the database. That became a startling endeavour:
>>
>> * The database holds 739803 entries
>
> not particularly large
>
> ...
>> The majority of the wordlist is filled with one-shot words! Looking
>> at a few, I found a!w๠and ARMDevices in my vocabulary.
> ...
>> isn't this the Achilles' heel of Bayesian filtering?
>
> No, and the fact that it doesn't much matter is one of its greatest
> strengths. It allows the filter to find the useful tokens for itself.
>
> Lookup times should be relatively insensitive to database size - there's
> no intrinsic reason why they can't be independent of the number of
> tokens. The file size isn't very important until it becomes a
> significant fraction of RAM size.
"Relatively" as in somewhat logarithmic, since most databases that
bogofilter uses use some form of B-whatever-tree, so a multiplication of
DB size with a factor causes a linear raise in access time.
We tried Berkeley DB Hash back then to get rid of that, but found that
BTree worked better overall.
>> * the same date is (currently) present in all entries
>
> This is a limitation of bogofilter that the timestamp only gets updated
> when the token's counts are updated. But if you train both spam and ham
> regularly you can eventually age-out tokens, if you want, using
> bogoutil.
Well, if we kept access times for "read" tokens, we might be in for
least-recently-used eviction but would massively increase the write volume.
Bogoutil also allows the user to filter out seen-only-once tokens
(lower-case -c option -- by age, see the -a option). Not sure if it
really matters much: 44 MB seems small enough these days (it sure wasn't
when I built my first Linux PC on DX4 basis in the late 1990s).
More information about the bogofilter
mailing list