mdbx: made README content less ugly.

Change-Id: I537ab63a2d8a1cd3b84d5865f689ee53a29d4ad4
This commit is contained in:
Leonid Yuriev 2019-07-16 03:16:25 +03:00
parent 4adb1ab2d8
commit 7c7d5f4434

167
README.md
View File

@ -54,8 +54,8 @@ and free Continuous Integration service will be available.
- [Main features](#main-features) - [Main features](#main-features)
- [Improvements over LMDB](#improvements-over-lmdb) - [Improvements over LMDB](#improvements-over-lmdb)
- [Gotchas](#gotchas) - [Gotchas](#gotchas)
- [Long-time read transactions problem](#long-time-read-transactions-problem) - [Problem of long-time reading](#problem-of-long-time-reading)
- [Data safety in async-write-mode](#data-safety-in-async-write-mode) - [Durability in asynchronous writing mode](#durability-in-asynchronous-writing-mode)
- [Performance comparison](#performance-comparison) - [Performance comparison](#performance-comparison)
- [Integral performance](#integral-performance) - [Integral performance](#integral-performance)
- [Read scalability](#read-scalability) - [Read scalability](#read-scalability)
@ -72,42 +72,31 @@ for performance under Linux and Windows.
_libmdbx_ allows multiple processes to read and update several key-value _libmdbx_ allows multiple processes to read and update several key-value
tables concurrently, while being tables concurrently, while being
[ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal [ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal
overhead and operation cost of Olog(N). overhead and Olog(N) operation cost.
_libmdbx_ provides _libmdbx_ enforce [serializability](https://en.wikipedia.org/wiki/Serializability) for writers by single [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) and affords [wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) for parallel readers without atomic/interlocked operations, while writing and reading transactions do not block each other.
[serializability](https://en.wikipedia.org/wiki/Serializability) and
consistency of data after crash. Read-write transactions don't block
read-only transactions and are
[serialized](https://en.wikipedia.org/wiki/Serializability) by
[mutex](https://en.wikipedia.org/wiki/Mutual_exclusion).
_libmdbx_ _libmdbx_ can guarantee consistency after crash depending of operation mode.
[wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)
provides parallel read transactions without atomic operations or
synchronization primitives.
_libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and _libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and
[mmap](https://en.wikipedia.org/wiki/Memory-mapped_file), doesn't use [Memory-Mapping](https://en.wikipedia.org/wiki/Memory-mapped_file), doesn't use
[WAL](https://en.wikipedia.org/wiki/Write-ahead_logging). This might [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) which
have caveats for some workloads. might be a caveat for some workloads.
### Comparison with other DBs ### Comparison with other DBs
Because _libmdbx_ is currently overhauled, I think it's better to just For now please refer to [chapter of "BoltDB comparison with other
link [chapter of Comparison with other databases"](https://github.com/coreos/bbolt#comparison-with-other-databases)
databases](https://github.com/coreos/bbolt#comparison-with-other-databases) which is also (mostly) applicable to MDBX.
here.
### History ### History
The _libmdbx_ design is based on [Lightning Memory-Mapped The _libmdbx_ design is based on [Lightning Memory-Mapped
Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database). Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).
Initial development was going in Initial development was going in [ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project.
[ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project, about a About a year later libmdbx was isolated to separate project, which was [presented at Highload++
year later it received separate development effort and in autumn 2015
was isolated to separate project, which was [presented at Highload++
2015 conference](http://www.highload.ru/2015/abstracts/1831.html). 2015 conference](http://www.highload.ru/2015/abstracts/1831.html).
Since early 2017 _libmdbx_ is used in [Fast PositiveTables](https://github.com/leo-yuriev/libfpta), Since early 2017 _libmdbx_ is used in [Fast Positive Tables](https://github.com/leo-yuriev/libfpta),
by [Positive Technologies](https://www.ptsecurity.com). and development is funded by [Positive Technologies](https://www.ptsecurity.com).
#### Acknowledgments #### Acknowledgments
Howard Chu (Symas Corporation) - the author of LMDB, from which Howard Chu (Symas Corporation) - the author of LMDB, from which
@ -143,10 +132,10 @@ don't use [atomic
operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations). operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations).
Readers don't block each other and aren't blocked by writers. Read Readers don't block each other and aren't blocked by writers. Read
performance scales linearly with CPU core count. performance scales linearly with CPU core count.
> Though "connect to DB" (start of first read transaction in thread) and > Nonetheless, "connect to DB" (start of first read transaction in thread) and
> "disconnect from DB" (shutdown or thread termination) requires to > "disconnect from DB" (shutdown or thread termination) requires to
> acquire a lock to register/unregister current thread from "readers > acquire a lock to register/unregister current thread from "readers
> table" > table".
5. Keys with multiple values are stored efficiently without key 5. Keys with multiple values are stored efficiently without key
duplication, sorted by value, including integers (reasonable for duplication, sorted by value, including integers (reasonable for
@ -201,7 +190,7 @@ optimal query execution plan.
6. Support for keys and values of zero length, including sorted 6. Support for keys and values of zero length, including sorted
duplicates. duplicates.
7. Ability to assign up to 3 markers to commiting transaction with 7. Ability to assign up to 3 persistent 64-bit markers to commiting transaction with
`mdbx_canary_put()` and then get them in read transaction by `mdbx_canary_put()` and then get them in read transaction by
`mdbx_canary_get()`. `mdbx_canary_get()`.
@ -346,7 +335,7 @@ performance bottleneck in `MAPASYNC` mode.
> storage then it's much more preferable to use `std::map`. > storage then it's much more preferable to use `std::map`.
4. LMDB has a problem of long-time readers which degrades performance 4. _LMDB_ has a problem of long-time readers which degrades performance
and bloats DB. and bloats DB.
> _libmdbx_ addresses that, details below. > _libmdbx_ addresses that, details below.
@ -357,56 +346,41 @@ of data.
> Details below. > Details below.
#### Long-time read transactions problem #### Problem of long-time reading
Garbage collection problem exists in all databases one way or another Garbage collection problem exists in all databases one way or another
(e.g. VACUUM in PostgreSQL). But in _libmdbx_ and LMDB it's even more (e.g. VACUUM in PostgreSQL). But in _libmdbx_ and LMDB it's even more
important because of high performance and deliberate simplification of discernible because of high transaction rate and intentional internals
internals with emphasis on performance. simplification in favor of performance.
* Altering data during long read operation may exhaust available space Understanding the problem requires some explanation, but can be
on persistent storage. difficult for quick perception. So is is reasonable
to simplify this as follows:
* If available space is exhausted then any attempt to update data * Massive altering of data during a parallel long read operation may
results in `MAP_FULL` error until long read operation ends. exhaust the free DB space.
* Main examples of long readers is hot backup and debugging of client * If the available space is exhausted, any attempt to update the data
application which actively uses read transactions. * will cause a "MAP_FULL" error until a long read transaction is completed.
* A good example of long readers is a hot backup or debugging of
a client application while retaining an active read transaction.
* In _LMDB_ this results in degraded performance of all operations of * In _LMDB_ this results in degraded performance of all operations of
syncing data to persistent storage. writing data to persistent storage.
* _libmdbx_ has a mechanism which aborts such operations and `LIFO RECLAIM` * _libmdbx_ has the `OOM-KICK` mechanism which allow to abort such
mode which addresses performance degradation. operations and the `LIFO RECLAIM` mode which addresses performance
degradation.
Read operations operate only over snapshot of DB which is consistent on #### Durability in asynchronous writing mode
the moment when read transaction started. This snapshot doesn't change In `WRITEMAP+MAPSYNC` mode updated (aka dirty) pages are written
throughout the transaction but this leads to inability to reclaim the to persistent storage by the OS kernel. This means that if the
pages until read transaction ends. application fails, the OS kernel will finish writing all updated
data to disk and nothing will be lost.
In _LMDB_ this leads to a problem that memory pages, allocated for However, in the case of hardware malfunction or OS kernel fatal error,
operations during long read, will be used for operations and won't be only some updated data can be written to disk and the database structure
reclaimed until DB process terminates. In _LMDB_ they are used in is likely to be destroyed.
[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)) In such situation, DB is completely corrupted and can't be repaired.
manner, which causes increased page count and less chance of cache hit
during I/O. In other words: one long-time reader can impact performance
of all database until it'll be reopened.
_libmdbx_ addresses the problem, details below. Illustrations to this
problem can be found in the
[presentation](http://www.slideshare.net/leoyuriev/lmdb). There is also
example of performance increase thanks to
[BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration)
when `LIFO RECLAIM` enabled in _libmdbx_.
#### Data safety in async-write mode
In `WRITEMAP+MAPSYNC` mode dirty pages are written to persistent storage
by kernel. This means that in case of application crash OS kernel will
write all dirty data to disk and nothing will be lost. But in case of
hardware malfunction or OS kernel fatal error only some dirty data might
be synced to disk, and there is high probability that pages with
metadata saved, will point to non-saved, hence non-existent, data pages.
In such situation, DB is completely corrupted and can't be repaired even
if there was full sync before the crash via `mdbx_env_sync().
_libmdbx_ addresses this by fully reimplementing write path of data: _libmdbx_ addresses this by fully reimplementing write path of data:
@ -414,39 +388,38 @@ _libmdbx_ addresses this by fully reimplementing write path of data:
instead their shadow copies are used and their updates are synced after instead their shadow copies are used and their updates are synced after
data is flushed to disk. data is flushed to disk.
* During transaction commit _libmdbx_ marks synchronization points as * During transaction commit _libmdbx_ marks it as a steady or weak
steady or weak depending on how much synchronization needed between RAM depending on synchronization status between RAM and persistent storage.
and persistent storage, e.g. in `WRITEMAP+MAPSYNC` commited transactions For instance, in the `WRITEMAP+MAPSYNC` mode committed transactions
are marked as weak, but during explicit data synchronization - as are marked as weak by default, but as steady after explicit data flushes.
steady.
* _libmdbx_ maintains three separate meta-pages instead of two. This * _libmdbx_ maintains three separate meta-pages instead of two. This
allows to commit transaction with steady or weak synchronization point allows to commit transaction as steady or weak without losing two
without losing two previous synchronization points (one of them can be previous commit points (one of them can be steady, and another
steady, and second - weak). This allows to order weak and steady weak). Thus, after a fatal system failure, it will be possible to
synchronization points in any order without losing consistency in case rollback to the last steady commit point.
of system crash.
* During DB open _libmdbx_ rollbacks to the last steady synchronization * During DB open _libmdbx_ rollbacks to the last steady commit point,
point, this guarantees database integrity. this guarantees database integrity after a crash. However, if the
database opening in read-only mode, such rollback cannot be performed
which will cause returning the MDBX_WANNA_RECOVERY error.
For data safety pages which form database snapshot with steady For data integrity a pages which form database snapshot with steady
synchronization point must not be updated until next steady commit point, must not be updated until next steady commit point.
synchronization point. So last steady synchronization point creates Therefore the last steady commit point creates an effect analogues to "long-time read".
"long-time read" effect. The only difference that in case of memory The only difference that now in case of space exhaustion the problem
exhaustion the problem will be immediately addressed by flushing changes will be immediately addressed by writing changes to disk and forming
to persistent storage and forming new steady synchronization point. the new steady commit point.
So in async-write mode _libmdbx_ will always use new pages until memory So in async-write mode _libmdbx_ will always use new pages until the
is exhausted or `mdbx_env_sync()` is invoked. Total disk usage will be free DB space will be exhausted or `mdbx_env_sync()` will be invoked,
almost the same as in sync-write mode. and the total write traffic to the disk will be the same as in sync-write mode.
Current _libmdbx_ gives a choice of safe async-write mode (default) and Currently libmdbx gives a choice between a safe async-write mode (default) and
`UTTERLY_NOSYNC` mode which may result in full DB corruption during `UTTERLY_NOSYNC` mode which may lead to DB corruption after a system crash, i.e. like the LMDB.
system crash as with LMDB.
Next version of _libmdbx_ will create steady synchronization points Next version of _libmdbx_ will be automatically create steady commit
automatically in async-write mode. points in async-write mode upon completion transfer data to the disk.
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------