Adding a "fixmap" datastore: Fixed-size mmap() files

Matthias Andree matthias.andree at gmx.de
Thu Jul 22 08:30:13 CEST 2021


Am 21.07.21 um 14:20 schrieb Rick van Rein:
> Hello,
>
> I'm a long-time user/fan of Bogofilter, but am interested in a somewhat
> more general use case, namely sorting mail into topics or perhaps into
> mail aliases.  To enable that, and other larger-than-personal uses, I am
> playing with smaller backend stores.
>
> I created an initial datastore that seems promising enough to mention.
> README, code and Errors are at:
>
> https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/doc/README.fixmap
>
> https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/src/datastore_fixmap.c
>
> https://gitlab.com/arpa2/bogosort/-/blob/fixedsize/bogofilter/doc/fixmap-errors.md
>
> Briefly put, a 64 kB store is tight, but 1 MB looks very good.
> Especially the t.wordhist test shows that nicely.  I am not always sure
> about the interpretation of the test output, I fear.
>
>
> The tests however, fail massively.  The reason is that reproduction of
> output has small variations; different numeric data, different
> histograms, that is basically what remains in the 1 MB fixmap.  I'm not
> sure how to approach this, the tests make sense but are overly tight for
> my "lossy database" approach.

Rick,

the tests are written for a lossless database, and that is not going to
change.

We have tools, for instance contrib/bogominitrain.pl by Boris Piwinger,
to help with training for small databases.  Using such a database
without bogofilter's "-u" (auto-train) option should satisfy many needs
for small databases.

> For another version, I am brooding on a generalisation that spreads data
> more evenly around the database entries, a bit like in a hologram.  This
> should reduce the impact of word clashes, and evenly spread out their
> distortion.  As a result, even a 64 kB fixmap should be quite good; it
> has a capacity of a little over 8000 good/spam pairs, which is more than
> a common vocabulary.  (And I'm guessing that headers et al are not too
> upsetting in that respect.)

Sorry to say that this is too little information for me to grok your
idea.  Can you elaborate this a bit?

What "word clashes" do you think of and what problems do you see?

Cheers,
Matthias


More information about the bogofilter-dev mailing list