Adding a "fixmap" datastore: Fixed-size mmap() files
Rick van Rein
rick at openfortress.nl
Wed Jul 21 14:20:27 CEST 2021
Hello,
I'm a long-time user/fan of Bogofilter, but am interested in a somewhat
more general use case, namely sorting mail into topics or perhaps into
mail aliases. To enable that, and other larger-than-personal uses, I am
playing with smaller backend stores.
I created an initial datastore that seems promising enough to mention.
README, code and Errors are at:
https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/doc/README.fixmap
https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/src/datastore_fixmap.c
https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/doc/fixmap-errors.md
Briefly put, a 64 kB store is tight, but 1 MB looks very good.
Especially the t.wordhist test shows that nicely. I am not always sure
about the interpretation of the test output, I fear.
The tests however, fail massively. The reason is that reproduction of
output has small variations; different numeric data, different
histograms, that is basically what remains in the 1 MB fixmap. I'm not
sure how to approach this, the tests make sense but are overly tight for
my "lossy database" approach.
For another version, I am brooding on a generalisation that spreads data
more evenly around the database entries, a bit like in a hologram. This
should reduce the impact of word clashes, and evenly spread out their
distortion. As a result, even a 64 kB fixmap should be quite good; it
has a capacity of a little over 8000 good/spam pairs, which is more than
a common vocabulary. (And I'm guessing that headers et al are not too
upsetting in that respect.)
I'm interested in hearing feedback. If you want, I can send test data
output at 64 kB and some at 1 MB. My summary of that is linked above.
Cheers,
-Rick
Histogram serving as reference data
score count pct histogram
0.00 3515 66.30 ################################################
0.05 1 0.02 #
0.10 1 0.02 #
0.15 7 0.13 #
0.20 10 0.19 #
0.25 12 0.23 #
0.30 16 0.30 #
0.35 27 0.51 #
0.40 38 0.72 #
0.45 19 0.36 #
0.50 82 1.55 ##
0.55 10 0.19 #
0.60 29 0.55 #
0.65 135 2.55 ##
0.70 7 0.13 #
0.75 22 0.41 #
0.80 63 1.19 #
0.85 14 0.26 #
0.90 28 0.53 #
0.95 1266 23.88 ##################
tot 5302
hapaxes: ham 2593 (48.91%), spam 784 (14.79%)
pure: ham 3515 (66.30%), spam 1257 (23.71%)
Histogram of a 1 MB fixmap
score count pct histogram
0.00 3429 65.90 ################################################
0.05 1 0.02 #
0.10 1 0.02 #
0.15 6 0.12 #
0.20 10 0.19 #
0.25 14 0.27 #
0.30 15 0.29 #
0.35 27 0.52 #
0.40 39 0.75 #
0.45 18 0.35 #
0.50 88 1.69 ##
0.55 10 0.19 #
0.60 32 0.62 #
0.65 148 2.84 ###
0.70 6 0.12 #
0.75 22 0.42 #
0.80 69 1.33 #
0.85 14 0.27 #
0.90 30 0.58 #
0.95 1224 23.52 ##################
tot 5203
hapaxes: ham 2498 (48.01%), spam 751 (14.43%)
pure: ham 3429 (65.90%), spam 1215 (23.35%)
Histogram of a 64 kB fixmap
score count pct histogram
0.00 2260 58.17 ################################################
0.05 0 0.00
0.10 10 0.26 #
0.15 12 0.31 #
0.20 16 0.41 #
0.25 20 0.51 #
0.30 25 0.64 #
0.35 41 1.06 #
0.40 75 1.93 ##
0.45 22 0.57 #
0.50 128 3.29 ###
0.55 16 0.41 #
0.60 48 1.24 ##
0.65 252 6.49 ######
0.70 11 0.28 #
0.75 41 1.06 #
0.80 113 2.91 ###
0.85 33 0.85 #
0.90 35 0.90 #
0.95 727 18.71 ################
tot 3885
hapaxes: ham 1342 (34.54%), spam 416 (10.71%)
pure: ham 2260 (58.17%), spam 722 (18.58%)
More information about the bogofilter-dev
mailing list