Adding a "fixmap" datastore: Fixed-size mmap() files

Sun Jul 25 16:14:46 CEST 2021

Hello,

>>> I am brooding on a generalisation that spreads data
>>> more evenly around the database entries, a bit like in a hologram.

This surely was vague.  I'm clearer now.

Code:
https://gitlab.com/arpa2/bogosort/-/compare/c0a72cbf948880e3227b02411758392860554380...fixedsize?from_project_id=28247093

Idea:
I started off not storing words, but using them to compute a hash and
storing word counts at those hashes.  There are bound to be clashes in
such a setup, where different words end up in the same slot.  I wanted
to investigate how well that behaved.

My second try, shown above, spreads a word over the database using a
scheme that is more unique to the word.  I basically aim for the store
size and an increment from the storage position to be relative prime, so
I know I will make the complete round before I'm back at the original
storage position.  I add +1 to the first and -1 to the second half.

On average, the offset is 0.  This means that the best recover is when
precisely the same factors +1 and -1 are used.  All others find lower
values and see the average pattern of other words as a 0 offset.

This raises several concerns.  For one, what used to be a strict 0 base
is now something that averages to 0, but in every concrete situation it
deviates from 0.  This means the numbers obtained are less accurate than
they were (of course, with so much information dropped) but they are not
necessarily useless.  That is what I'm exploring.

The word histogram shows an inclination towards the average, which is in
no way surpising.  What I am still wondering is if it is a problem, or
could be removed with the right mathematical approach.  I knew that I
would loose information, but an open question still is how much of it is
a real loss.

It looks like about 25% of the information is lost, which "feels right"
to my intuition.  The obvious cases are still there, even in 64 kB, so
it may be a matter of selecting properly (it looks like Bogofilter
facilitates that with the --min-dev option already).

Cheers,
 -Rick

Histogram for a fixmap with filesize 64 kB
score   count  pct  histogram
0.00     1569 32.68 ################################################
0.05        6  0.12 #
0.10       19  0.40 #
0.15       40  0.83 ##
0.20       55  1.15 ##
0.25       59  1.23 ##
0.30      128  2.67 ####
0.35      239  4.98 ########
0.40       97  2.02 ###
0.45      314  6.54 ##########
0.50       61  1.27 ##
0.55      142  2.96 #####
0.60       56  1.17 ##
0.65      430  8.96 ##############
0.70      105  2.19 ####
0.75      141  2.94 #####
0.80        6  0.12 #
0.85       74  1.54 ###
0.90       28  0.58 #
0.95     1232 25.66 ######################################
tot      4801
hapaxes:  ham     523 (10.89%), spam     661 (13.77%)
   pure:  ham    1567 (32.64%), spam    1224 (25.49%)

Histogram for a fixmap with filesize 1 MB (high word counts!!)
score   count  pct  histogram
0.00    14252 64.63 ################################################
0.05        1  0.00 #
0.10        4  0.02 #
0.15       13  0.06 #
0.20        5  0.02 #
0.25       29  0.13 #
0.30       45  0.20 #
0.35       79  0.36 #
0.40      302  1.37 ##
0.45       93  0.42 #
0.50      869  3.94 ###
0.55       34  0.15 #
0.60      130  0.59 #
0.65       64  0.29 #
0.70     1826  8.28 #######
0.75       46  0.21 #
0.80      178  0.81 #
0.85       51  0.23 #
0.90       68  0.31 #
0.95     3963 17.97 ##############
tot     22052
hapaxes:  ham   11976 (54.31%), spam    3258 (14.77%)
   pure:  ham   14252 (64.63%), spam    3949 (17.91%)

Histogram for a loss-free datastore
score   count  pct  histogram
0.00     3515 66.30 ################################################
0.05        1  0.02 #
0.10        1  0.02 #
0.15        7  0.13 #
0.20       10  0.19 #
0.25       12  0.23 #
0.30       16  0.30 #
0.35       27  0.51 #
0.40       38  0.72 #
0.45       19  0.36 #
0.50       82  1.55 ##
0.55       10  0.19 #
0.60       29  0.55 #
0.65      135  2.55 ##
0.70        7  0.13 #
0.75       22  0.41 #
0.80       63  1.19 #
0.85       14  0.26 #
0.90       28  0.53 #
0.95     1266 23.88 ##################
tot      5302
hapaxes:  ham    2593 (48.91%), spam     784 (14.79%)
   pure:  ham    3515 (66.30%), spam    1257 (23.71%)