DB backend support for lmdb?

Steffen Nurpmeso steffen at sdaoden.eu
Tue May 29 20:51:11 CEST 2018


Ahoi.

Matthias Andree wrote:
 |Am 28.05.2018 um 23:57 schrieb Steffen Nurpmeso:
 |>|Note that -u incurs writes so is prone to whatever consistency and
 |>|durability models the database uses - and it will hurt with LMDB too due
 |>|to its "one writer only", so you serialize those processes if you use
 |>|"-u" just as much as you would with -n or -s (unless you have lots of
 |>|unsures, when -u is useless).
 |> 
 |> Hmm, likely the test was not thorough enough.  Sorry.  Yes, of
 |> course, but likely that because the same messages had been passed
 |> nothing new happened (but counter increments?); the sqlite version
 |> did change the size, and the DB one not, that was definetely true.
 |
 |It depends on whether the write causes page overflows and new pages to
 |be created, or not.  Certainly using SQLite is overkill because we still
 |used it as a key/value store... but it was a user request at the time
 |and I have yet to see reports about other trouble than it not being the
 |fastest or writing the smallest databases.

Please, i have absolutely nothing against sqlite, i used it
myself.  It may even be interesting to have the possibility to
look into the tables and access them with SQL statements, for
those who have interest in that.

 |>|>|The database implementation needs some support logic in
 |>|>|bogofilter/configure.ac and bogofilter/src/Makefile.am
 |>|> 
 |>|> That surely is the very very hard part.  :)
 |>|
 |>|Not if you know autotools (autoconf/automake) a bit. O:-)
 |> 
 |> It seems to be known that i really dislike this!  Yes, that is
 |> true; of course this stuff is powerful, and depending on the
 |> project it may be just ok.  For me i dislike it and like projects
 |> which only test what they really need and possible work fine with
 |> a simple "make".  Take bogofilter, for example.  Surely the m4/ is
 |> pretty small, but the configuration performs many tests which
 |> could be combined (for example, the integer typedefs), and then
 |> tests aclocal, automake and autoconf (why?), ..and then.. runs
 |> config.status --recheck and all the stuff is tested once again!
 |> Then compilation starts.  Yay.
 |
 |A --recheck should not happen on the tarballs, except if your clock
 |(system or file system timestamps) is very coarse or non-monotonic.
 |After "svn update" it will usually happen.

Maybe because i have it in git, and all those repos are reduced to
a "null" branch to save backup space.  If i need something i check
out the "master" and compile that.  And git does not restore file
times when the checkout happens.  Ok, so maybe also my fault.

 |If it only hurts developers, that's a non-issue, and I paid attention to
 |write the configure cache out in strategic places.  You know that when I
 |started developing bogofilter c. 15 years ago, computers and disk drives
 |were a lot slower.
 |
 |./configure -C # ...  advised. :-)

Yes i know the former.  The latter not, i will try it; especially
today many projects with submodules require multiple configuration
runs, i hope that helps there.  I have added that to all
configure runs in my (extern.)code.arena makefile.  Thanks!

 |The autotools stuff is in place, works reliably, has a rich feature set,
 |and with recent automake implementations, "make check" tests run in
 |parallel.  Do that with SSD on a modern octocore computer and see it fly
 |in spite of a recursive Makefile structure, or perhaps with /tmp a
 |RAMDISK and then make check BF_TESTDIR=/tmp -- I wouldn't want to use an
 |early Raspberry Pi as development platform though.  Deployment is
 |another matter.

hmmhmm, yes, well.. ;)

 |>|It depends. SQLite uses the same extension, .db, and Berkeley DB has two
 |>|modes, one is the plain old (which is not ACID compliant) and has only
 |>|the one [wordlist].db file - I advise against using that unless you can
 |>|recreate the database anytime from saved spam/ham corpora.
 |> 
 |> Ah!  This i did not know, i have always worked with packages until
 |> just recently, after finding the space and performance issue
 |> i compiled on my own the first time.  Sorry, sorry; i have read
 |> the bogofilter manual once when i had thrown away the homebrew
 |> junk mail code from the MUA i maintain, in order to create
 |> a working environment for the new mail code -- in summer 2013.
 |
 |bogofilter -V tells you the database it is using, and there's doc/README.db.
 |
 |>|The other is the transactional mode (advertised as this Berkeley DB
 |>|Transactional Data Store) that writes additional files, for instance,
 |>|trivially:
 |>|
 |>|__db.001
 |>|__db.002
 |>|__db.003
 |>|lockfile-d
 |>|lockfile-p
 |>|log.0000000001
 |>|wordlist.db
 |> 
 |> Yes, that is what i knew, and this is what i get after
 |> recompilation with --enable-transactions.  Thanks for the
 |> information!
 |
 |And perhaps after running any registering stuff with
 |--db-transaction=yes once. And perhaps undoing the same registration.

It is a powerful tool, and i am using only 5 percent of it.
(That command line option is not to be seen in a manual page.)

 |> It seems to me there is quite a lot of context that i did not know
 |> about.  So i think implementing LMDB support will not be a quick
 |> shot if done right, i need to read a lot of documentation from
 |> LMDB and source code from Bogofilter.  So it may take a little bit
 |
 |LMDB should be less of a hassle - and on second thought, it might be
 |easier to start the implementation off after reading the Berkeley DB
 |*and* Kyotocabinet (or Tokyocabinet) implementations to see the simpler
 |ones.

Will do, soon.

 |> longer until i can provide a patch -- nonetheless, i am definetely
 |> very interested in LMDB support for bogofilter, if doable, because
 |> it is very small (the raw AlpineLinux code package is 90KB,
 |> whereas DB is 1.6MB; the cloned repo is 1.2MB, whereas the 5.3.28
 |> DB tar ball unpacked in git is 31MB), and the code is also open
 |> and openly maintained.  And Postfix supports LMDB as a replacement
 |> for DB out of the box, too.  All this is very desirable to me.
 |
 |Repo size of a support library isn't normally a relevant metric, but
 |this is a valid point, as is its license:
 |
 |   text          data     bss     dec     hex filename
 |  80510          1504       8   82022   14066 /usr/lib64/liblmdb.so

Runtime is much smaller here, too:

  #?0[steffen at essex nail.git]$ size /usr/lib/liblmdb.so 
     text    data     bss     dec     hex filename
    69680    1344      80   71104   115c0 /usr/lib/liblmdb.so
  #?0[steffen at essex nail.git]$ size /usr/lib/libdb.so
     text    data     bss     dec     hex filename
  1549515   38744      64 1588323  183c63 /usr/lib/libdb.so

I am looking forward for this.
Ciao, and thanks for the informations!

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



More information about the bogofilter-dev mailing list