Adding a "fixmap" datastore: Fixed-size mmap() files

Rick van Rein rick at openfortress.nl
Wed Jul 21 14:20:27 CEST 2021


Hello,

I'm a long-time user/fan of Bogofilter, but am interested in a somewhat
more general use case, namely sorting mail into topics or perhaps into
mail aliases.  To enable that, and other larger-than-personal uses, I am
playing with smaller backend stores.

I created an initial datastore that seems promising enough to mention.
README, code and Errors are at:

https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/doc/README.fixmap

https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/src/datastore_fixmap.c

https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/doc/fixmap-errors.md

Briefly put, a 64 kB store is tight, but 1 MB looks very good.
Especially the t.wordhist test shows that nicely.  I am not always sure
about the interpretation of the test output, I fear.


The tests however, fail massively.  The reason is that reproduction of
output has small variations; different numeric data, different
histograms, that is basically what remains in the 1 MB fixmap.  I'm not
sure how to approach this, the tests make sense but are overly tight for
my "lossy database" approach.


For another version, I am brooding on a generalisation that spreads data
more evenly around the database entries, a bit like in a hologram.  This
should reduce the impact of word clashes, and evenly spread out their
distortion.  As a result, even a 64 kB fixmap should be quite good; it
has a capacity of a little over 8000 good/spam pairs, which is more than
a common vocabulary.  (And I'm guessing that headers et al are not too
upsetting in that respect.)


I'm interested in hearing feedback.  If you want, I can send test data
output at 64 kB and some at 1 MB.  My summary of that is linked above.


Cheers,
 -Rick


Histogram serving as reference data
score   count  pct  histogram
0.00     3515 66.30 ################################################
0.05        1  0.02 #
0.10        1  0.02 #
0.15        7  0.13 #
0.20       10  0.19 #
0.25       12  0.23 #
0.30       16  0.30 #
0.35       27  0.51 #
0.40       38  0.72 #
0.45       19  0.36 #
0.50       82  1.55 ##
0.55       10  0.19 #
0.60       29  0.55 #
0.65      135  2.55 ##
0.70        7  0.13 #
0.75       22  0.41 #
0.80       63  1.19 #
0.85       14  0.26 #
0.90       28  0.53 #
0.95     1266 23.88 ##################
tot      5302
hapaxes:  ham    2593 (48.91%), spam     784 (14.79%)
   pure:  ham    3515 (66.30%), spam    1257 (23.71%)


Histogram of a 1 MB fixmap
score   count  pct  histogram
0.00     3429 65.90 ################################################
0.05        1  0.02 #
0.10        1  0.02 #
0.15        6  0.12 #
0.20       10  0.19 #
0.25       14  0.27 #
0.30       15  0.29 #
0.35       27  0.52 #
0.40       39  0.75 #
0.45       18  0.35 #
0.50       88  1.69 ##
0.55       10  0.19 #
0.60       32  0.62 #
0.65      148  2.84 ###
0.70        6  0.12 #
0.75       22  0.42 #
0.80       69  1.33 #
0.85       14  0.27 #
0.90       30  0.58 #
0.95     1224 23.52 ##################
tot      5203
hapaxes:  ham    2498 (48.01%), spam     751 (14.43%)
   pure:  ham    3429 (65.90%), spam    1215 (23.35%)


Histogram of a 64 kB fixmap
score   count  pct  histogram
0.00     2260 58.17 ################################################
0.05        0  0.00
0.10       10  0.26 #
0.15       12  0.31 #
0.20       16  0.41 #
0.25       20  0.51 #
0.30       25  0.64 #
0.35       41  1.06 #
0.40       75  1.93 ##
0.45       22  0.57 #
0.50      128  3.29 ###
0.55       16  0.41 #
0.60       48  1.24 ##
0.65      252  6.49 ######
0.70       11  0.28 #
0.75       41  1.06 #
0.80      113  2.91 ###
0.85       33  0.85 #
0.90       35  0.90 #
0.95      727 18.71 ################
tot      3885
hapaxes:  ham    1342 (34.54%), spam     416 (10.71%)
   pure:  ham    2260 (58.17%), spam     722 (18.58%)


More information about the bogofilter-dev mailing list