Database Size versus Shannon's Word Entropy

Rick van Rein rick at openfortress.nl
Sun Oct 22 20:11:37 CEST 2017


Hello,

I was wondering why the Bogofilter database grows so large.  I have one
of around 44 MB, while a quick entropy calculation goes like this:

 1. I use Dutch and English, probably good for 3000+2000 words
 2. I could scatter words over bits with insecure hash FNV-1a [0]
 3. I might cut off 12 bits for 8000 entries with some clashes
 4. This coincides with Shannon's findings on word entropy [1]
 5. For each word, I would store 2 counters of 16 or 32 bits
 6. Total storage space is then 32 kB or 64 kB, 1000x smaller

So I plunged into the database.  That became a startling endeavour:

 * The database holds 739803 entries
 * the same date is (currently) present in all entries
 * 3577   entries (1,0) out of 6896   entries (1,*)
 * 593753 entries (0,1) out of 594170 entries (*,1)
 * 142473 entries are neither(1,0) nor (0,1)

The majority of the wordlist is filled with one-shot words!  Looking at
a few, I found a!wƒà¹ and ARMDevices in my vocabulary.  I am sure, given
the mechanism of rounding towards the extremes of spam and non-spam,
that all-or-nothing probabilities are very helpful, but isn't this the
Achilles' heel of Bayesian filtering?

I mean, it is jolly good for recall measurements.  But once received, I
don't really care about a message's reclassification.  In real-world
terms, my concern is with new messages [2].  And I am starting to wonder
if recall is any meaure to go by for that.  If not, we may be wasting
80% of our database storage space... not to mention that search times
are longer than needed.

BTW:
I also see ways of improving on the statistical sensitivity of Bayesian,
so I'm not here to burn the concept :) but I was very surprised to find
this...


Hopefully helpful :)

 -Rick


[0] http://www.isthe.com/chongo/tech/comp/fnv/
[1]
http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm
[2] This message is covered though, due to its mention of a!wƒà¹


More information about the bogofilter mailing list