Adding a "fixmap" datastore: Fixed-size mmap() files
Rick van Rein
rick at openfortress.nl
Sun Jul 25 16:14:46 CEST 2021
Hello,
>>> I am brooding on a generalisation that spreads data
>>> more evenly around the database entries, a bit like in a hologram.
This surely was vague. I'm clearer now.
Code:
https://gitlab.com/arpa2/bogosort/-/compare/c0a72cbf948880e3227b02411758392860554380...fixedsize?from_project_id=28247093
Idea:
I started off not storing words, but using them to compute a hash and
storing word counts at those hashes. There are bound to be clashes in
such a setup, where different words end up in the same slot. I wanted
to investigate how well that behaved.
My second try, shown above, spreads a word over the database using a
scheme that is more unique to the word. I basically aim for the store
size and an increment from the storage position to be relative prime, so
I know I will make the complete round before I'm back at the original
storage position. I add +1 to the first and -1 to the second half.
On average, the offset is 0. This means that the best recover is when
precisely the same factors +1 and -1 are used. All others find lower
values and see the average pattern of other words as a 0 offset.
This raises several concerns. For one, what used to be a strict 0 base
is now something that averages to 0, but in every concrete situation it
deviates from 0. This means the numbers obtained are less accurate than
they were (of course, with so much information dropped) but they are not
necessarily useless. That is what I'm exploring.
The word histogram shows an inclination towards the average, which is in
no way surpising. What I am still wondering is if it is a problem, or
could be removed with the right mathematical approach. I knew that I
would loose information, but an open question still is how much of it is
a real loss.
It looks like about 25% of the information is lost, which "feels right"
to my intuition. The obvious cases are still there, even in 64 kB, so
it may be a matter of selecting properly (it looks like Bogofilter
facilitates that with the --min-dev option already).
Cheers,
-Rick
Histogram for a fixmap with filesize 64 kB
score count pct histogram
0.00 1569 32.68 ################################################
0.05 6 0.12 #
0.10 19 0.40 #
0.15 40 0.83 ##
0.20 55 1.15 ##
0.25 59 1.23 ##
0.30 128 2.67 ####
0.35 239 4.98 ########
0.40 97 2.02 ###
0.45 314 6.54 ##########
0.50 61 1.27 ##
0.55 142 2.96 #####
0.60 56 1.17 ##
0.65 430 8.96 ##############
0.70 105 2.19 ####
0.75 141 2.94 #####
0.80 6 0.12 #
0.85 74 1.54 ###
0.90 28 0.58 #
0.95 1232 25.66 ######################################
tot 4801
hapaxes: ham 523 (10.89%), spam 661 (13.77%)
pure: ham 1567 (32.64%), spam 1224 (25.49%)
Histogram for a fixmap with filesize 1 MB (high word counts!!)
score count pct histogram
0.00 14252 64.63 ################################################
0.05 1 0.00 #
0.10 4 0.02 #
0.15 13 0.06 #
0.20 5 0.02 #
0.25 29 0.13 #
0.30 45 0.20 #
0.35 79 0.36 #
0.40 302 1.37 ##
0.45 93 0.42 #
0.50 869 3.94 ###
0.55 34 0.15 #
0.60 130 0.59 #
0.65 64 0.29 #
0.70 1826 8.28 #######
0.75 46 0.21 #
0.80 178 0.81 #
0.85 51 0.23 #
0.90 68 0.31 #
0.95 3963 17.97 ##############
tot 22052
hapaxes: ham 11976 (54.31%), spam 3258 (14.77%)
pure: ham 14252 (64.63%), spam 3949 (17.91%)
Histogram for a loss-free datastore
score count pct histogram
0.00 3515 66.30 ################################################
0.05 1 0.02 #
0.10 1 0.02 #
0.15 7 0.13 #
0.20 10 0.19 #
0.25 12 0.23 #
0.30 16 0.30 #
0.35 27 0.51 #
0.40 38 0.72 #
0.45 19 0.36 #
0.50 82 1.55 ##
0.55 10 0.19 #
0.60 29 0.55 #
0.65 135 2.55 ##
0.70 7 0.13 #
0.75 22 0.41 #
0.80 63 1.19 #
0.85 14 0.26 #
0.90 28 0.53 #
0.95 1266 23.88 ##################
tot 5302
hapaxes: ham 2593 (48.91%), spam 784 (14.79%)
pure: ham 3515 (66.30%), spam 1257 (23.71%)
More information about the bogofilter-dev
mailing list