From 0bf9d06e8b090e2d9783d03074f3752ed708f6cf Mon Sep 17 00:00:00 2001 From: Leonid Yuriev Date: Fri, 27 Nov 2020 16:31:12 +0300 Cc: Heiko Thiery Cc: Thomas Petazzoni Subject: [PATCH v5 0/1] cover letter for package/libmdbx: new package (library/database) This patch adds libmdbx v0.9.2 and below is a brief overview of libmdbx. Please merge. Regards, Leonid. -- libmdbx is an extremely fast, compact, powerful, embedded, transactional key-value database, with permissive license. libmdbx has a specific set of properties and capabilities, focused on creating unique lightweight solutions. Historically, libmdbx (MDBX) is a deeply revised and extended descendant of the legendary LMDB (Lightning Memory-Mapped Database). libmdbx inherits all benefits from LMDB, but resolves some issues and adds a set of improvements. According to developers, for now libmdbx surpasses the LMDB in terms of reliability, features and performance. The most important differences MDBX from LMDB: ============================================== 1. More attention is paid to the quality of the code, to an "unbreakability" of the API, to testing and automatic checks (i.e. sanitizers, etc). So there: - more control during operation; - more checking parameters, internal audit of database structures; - no warnings from compiler; - no issues from ASAN, UBSAN, Valgrind, Coverity; - etc. 2. Keys could be more than 2 times longer than LMDB. 3. Up to 20% faster than LMDB in CRUD benchmarks. 4. Automatic on-the-fly database size adjustment, both increment and reduction. 5. Automatic continuous zero-overhead database compactification. 6. The same database format for 32- and 64-bit builds. 7. LIFO policy for Garbage Collection recycling (this can significantly increase write performance due write-back disk cache up to several times in a best case scenario). 8. Range query estimation. 9. Utility for checking the integrity of the database structure with some recovery capabilities. For more info please refer: - https://github.com/erthink/libmdbx for source code and README. - https://erthink.github.io/libmdbx for API description. -- MDBX is a Btree-based database management library modeled loosely on the BerkeleyDB API, but much simplified. The entire database (aka "environment") is exposed in a memory map, and all data fetches return data directly from the mapped memory, so no malloc's or memcpy's occur during data fetches. As such, the library is extremely simple because it requires no page caching layer of its own, and it is extremely high performance and memory-efficient. It is also fully transactional with full ACID semantics, and when the memory map is read-only, the database integrity cannot be corrupted by stray pointer writes from application code. The library is fully thread-aware and supports concurrent read/write access from multiple processes and threads. Data pages use a copy-on-write strategy so no active data pages are ever overwritten, which also provides resistance to corruption and eliminates the need of any special recovery procedures after a system crash. Writes are fully serialized; only one write transaction may be active at a time, which guarantees that writers can never deadlock. The database structure is multi-versioned so readers run with no locks; writers cannot block readers, and readers don't block writers. Unlike other well-known database mechanisms which use either write-ahead transaction logs or append-only data writes, MDBX requires no maintenance during operation. Both write-ahead loggers and append-only databases require periodic checkpointing and/or compaction of their log or database files otherwise they grow without bound. MDBX tracks retired/freed pages within the database and re-uses them for new write operations, so the database size does not grow without bound in normal use. The memory map can be used as a read-only or read-write map. It is read-only by default as this provides total immunity to corruption. Using read-write mode offers much higher write performance, but adds the possibility for stray application writes thru pointers to silently corrupt the database. Features ======== - Key-value data model, keys are always sorted. - Fully ACID-compliant, through to MVCC and CoW. - Multiple key-value sub-databases within a single datafile. - Range lookups, including range query estimation. - Efficient support for short fixed length keys, including native 32/64-bit integers. - Ultra-efficient support for multimaps. Multi-values sorted, searchable and iterable. Keys stored without duplication. - Data is memory-mapped and accessible directly/zero-copy. Traversal of database records is extremely-fast. - Transactions for readers and writers, ones do not block others. - Writes are strongly serialized. No transaction conflicts nor deadlocks. - Readers are non-blocking, notwithstanding snapshot isolation. - Nested write transactions. - Reads scale linearly across CPUs. - Continuous zero-overhead database compactification. - Automatic on-the-fly database size adjustment. - Customizable database page size. - Olog(N) cost of lookup, insert, update, and delete operations by virtue of B+ tree characteristics. - Online hot backup. - Append operation for efficient bulk insertion of pre-sorted data. - No WAL nor any transaction journal. No crash recovery needed. No maintenance is required. - No internal cache and/or memory management, all done by basic OS services. Limitations =========== - Page size: a power of 2, maximum 65536 bytes, default 4096 bytes. - Key size: minimum 0, maximum ≈¼ pagesize (1300 bytes for default 4K pagesize, 21780 bytes for 64K pagesize). - Value size: minimum 0, maximum 2146435072 (0x7FF00000) bytes for maps, ≈¼ pagesize for multimaps (1348 bytes default 4K pagesize, 21828 bytes for 64K pagesize). - Write transaction size: up to 4194301 (0x3FFFFD) pages (16 GiB for default 4K pagesize, 256 GiB for 64K pagesize). - Database size: up to 2147483648 pages (8 TiB for default 4K pagesize, 128 TiB for 64K pagesize). - Maximum sub-databases: 32765. Gotchas ======= - There cannot be more than one writer at a time, i.e. no more than one write transaction at a time. - libmdbx is based on B+ tree, so access to database pages is mostly random. Thus SSDs provide a significant performance boost over spinning disks for large databases. - libmdbx uses shadow paging instead of WAL. Thus syncing data to disk might be a bottleneck for write intensive workload. - libmdbx uses copy-on-write for snapshot isolation during updates, but read transactions prevents recycling an old retired/freed pages, since it read ones. Thus altering of data during a parallel long-lived read operation will increase the process work set, may exhaust entire free database space, the database can grow quickly, and result in performance degradation. Try to avoid long running read transactions. - libmdbx is extraordinarily fast and provides minimal overhead for data access, so you should reconsider using brute force techniques and double check your code. On the one hand, in the case of libmdbx, a simple linear search may be more profitable than complex indexes. On the other hand, if you make something suboptimally, you can notice detrimentally only on sufficiently large data. -- Leonid Yuriev (1): package/libmdbx: new package (library/database). DEVELOPERS | 3 +++ package/Config.in | 1 + package/libmdbx/Config.in | 45 ++++++++++++++++++++++++++++++++++++ package/libmdbx/libmdbx.hash | 5 ++++ package/libmdbx/libmdbx.mk | 33 ++++++++++++++++++++++++++ 5 files changed, 87 insertions(+) create mode 100644 package/libmdbx/Config.in create mode 100644 package/libmdbx/libmdbx.hash create mode 100644 package/libmdbx/libmdbx.mk -- 2.29.2