Database Size versus Shannon's Word Entropy
Rick van Rein
rick at openfortress.nl
Sun Oct 22 20:11:37 CEST 2017
Hello,
I was wondering why the Bogofilter database grows so large. I have one
of around 44 MB, while a quick entropy calculation goes like this:
1. I use Dutch and English, probably good for 3000+2000 words
2. I could scatter words over bits with insecure hash FNV-1a [0]
3. I might cut off 12 bits for 8000 entries with some clashes
4. This coincides with Shannon's findings on word entropy [1]
5. For each word, I would store 2 counters of 16 or 32 bits
6. Total storage space is then 32 kB or 64 kB, 1000x smaller
So I plunged into the database. That became a startling endeavour:
* The database holds 739803 entries
* the same date is (currently) present in all entries
* 3577 entries (1,0) out of 6896 entries (1,*)
* 593753 entries (0,1) out of 594170 entries (*,1)
* 142473 entries are neither(1,0) nor (0,1)
The majority of the wordlist is filled with one-shot words! Looking at
a few, I found a!w๠and ARMDevices in my vocabulary. I am sure, given
the mechanism of rounding towards the extremes of spam and non-spam,
that all-or-nothing probabilities are very helpful, but isn't this the
Achilles' heel of Bayesian filtering?
I mean, it is jolly good for recall measurements. But once received, I
don't really care about a message's reclassification. In real-world
terms, my concern is with new messages [2]. And I am starting to wonder
if recall is any meaure to go by for that. If not, we may be wasting
80% of our database storage space... not to mention that search times
are longer than needed.
BTW:
I also see ways of improving on the statistical sensitivity of Bayesian,
so I'm not here to burn the concept :) but I was very surprised to find
this...
Hopefully helpful :)
-Rick
[0] http://www.isthe.com/chongo/tech/comp/fnv/
[1]
http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm
[2] This message is covered though, due to its mention of a!wà¹
More information about the bogofilter
mailing list