mirror of
https://github.com/isar/libmdbx.git
synced 2025-01-04 17:14:12 +08:00
mdbx: made README content less ugly.
Change-Id: I537ab63a2d8a1cd3b84d5865f689ee53a29d4ad4
This commit is contained in:
parent
4adb1ab2d8
commit
7c7d5f4434
167
README.md
167
README.md
@ -54,8 +54,8 @@ and free Continuous Integration service will be available.
|
|||||||
- [Main features](#main-features)
|
- [Main features](#main-features)
|
||||||
- [Improvements over LMDB](#improvements-over-lmdb)
|
- [Improvements over LMDB](#improvements-over-lmdb)
|
||||||
- [Gotchas](#gotchas)
|
- [Gotchas](#gotchas)
|
||||||
- [Long-time read transactions problem](#long-time-read-transactions-problem)
|
- [Problem of long-time reading](#problem-of-long-time-reading)
|
||||||
- [Data safety in async-write-mode](#data-safety-in-async-write-mode)
|
- [Durability in asynchronous writing mode](#durability-in-asynchronous-writing-mode)
|
||||||
- [Performance comparison](#performance-comparison)
|
- [Performance comparison](#performance-comparison)
|
||||||
- [Integral performance](#integral-performance)
|
- [Integral performance](#integral-performance)
|
||||||
- [Read scalability](#read-scalability)
|
- [Read scalability](#read-scalability)
|
||||||
@ -72,42 +72,31 @@ for performance under Linux and Windows.
|
|||||||
_libmdbx_ allows multiple processes to read and update several key-value
|
_libmdbx_ allows multiple processes to read and update several key-value
|
||||||
tables concurrently, while being
|
tables concurrently, while being
|
||||||
[ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal
|
[ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal
|
||||||
overhead and operation cost of Olog(N).
|
overhead and Olog(N) operation cost.
|
||||||
|
|
||||||
_libmdbx_ provides
|
_libmdbx_ enforce [serializability](https://en.wikipedia.org/wiki/Serializability) for writers by single [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) and affords [wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) for parallel readers without atomic/interlocked operations, while writing and reading transactions do not block each other.
|
||||||
[serializability](https://en.wikipedia.org/wiki/Serializability) and
|
|
||||||
consistency of data after crash. Read-write transactions don't block
|
|
||||||
read-only transactions and are
|
|
||||||
[serialized](https://en.wikipedia.org/wiki/Serializability) by
|
|
||||||
[mutex](https://en.wikipedia.org/wiki/Mutual_exclusion).
|
|
||||||
|
|
||||||
_libmdbx_
|
_libmdbx_ can guarantee consistency after crash depending of operation mode.
|
||||||
[wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)
|
|
||||||
provides parallel read transactions without atomic operations or
|
|
||||||
synchronization primitives.
|
|
||||||
|
|
||||||
_libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and
|
_libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and
|
||||||
[mmap](https://en.wikipedia.org/wiki/Memory-mapped_file), doesn't use
|
[Memory-Mapping](https://en.wikipedia.org/wiki/Memory-mapped_file), doesn't use
|
||||||
[WAL](https://en.wikipedia.org/wiki/Write-ahead_logging). This might
|
[WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) which
|
||||||
have caveats for some workloads.
|
might be a caveat for some workloads.
|
||||||
|
|
||||||
### Comparison with other DBs
|
### Comparison with other DBs
|
||||||
Because _libmdbx_ is currently overhauled, I think it's better to just
|
For now please refer to [chapter of "BoltDB comparison with other
|
||||||
link [chapter of Comparison with other
|
databases"](https://github.com/coreos/bbolt#comparison-with-other-databases)
|
||||||
databases](https://github.com/coreos/bbolt#comparison-with-other-databases)
|
which is also (mostly) applicable to MDBX.
|
||||||
here.
|
|
||||||
|
|
||||||
### History
|
### History
|
||||||
The _libmdbx_ design is based on [Lightning Memory-Mapped
|
The _libmdbx_ design is based on [Lightning Memory-Mapped
|
||||||
Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).
|
Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).
|
||||||
Initial development was going in
|
Initial development was going in [ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project.
|
||||||
[ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project, about a
|
About a year later libmdbx was isolated to separate project, which was [presented at Highload++
|
||||||
year later it received separate development effort and in autumn 2015
|
|
||||||
was isolated to separate project, which was [presented at Highload++
|
|
||||||
2015 conference](http://www.highload.ru/2015/abstracts/1831.html).
|
2015 conference](http://www.highload.ru/2015/abstracts/1831.html).
|
||||||
|
|
||||||
Since early 2017 _libmdbx_ is used in [Fast PositiveTables](https://github.com/leo-yuriev/libfpta),
|
Since early 2017 _libmdbx_ is used in [Fast Positive Tables](https://github.com/leo-yuriev/libfpta),
|
||||||
by [Positive Technologies](https://www.ptsecurity.com).
|
and development is funded by [Positive Technologies](https://www.ptsecurity.com).
|
||||||
|
|
||||||
#### Acknowledgments
|
#### Acknowledgments
|
||||||
Howard Chu (Symas Corporation) - the author of LMDB, from which
|
Howard Chu (Symas Corporation) - the author of LMDB, from which
|
||||||
@ -143,10 +132,10 @@ don't use [atomic
|
|||||||
operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations).
|
operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations).
|
||||||
Readers don't block each other and aren't blocked by writers. Read
|
Readers don't block each other and aren't blocked by writers. Read
|
||||||
performance scales linearly with CPU core count.
|
performance scales linearly with CPU core count.
|
||||||
> Though "connect to DB" (start of first read transaction in thread) and
|
> Nonetheless, "connect to DB" (start of first read transaction in thread) and
|
||||||
> "disconnect from DB" (shutdown or thread termination) requires to
|
> "disconnect from DB" (shutdown or thread termination) requires to
|
||||||
> acquire a lock to register/unregister current thread from "readers
|
> acquire a lock to register/unregister current thread from "readers
|
||||||
> table"
|
> table".
|
||||||
|
|
||||||
5. Keys with multiple values are stored efficiently without key
|
5. Keys with multiple values are stored efficiently without key
|
||||||
duplication, sorted by value, including integers (reasonable for
|
duplication, sorted by value, including integers (reasonable for
|
||||||
@ -201,7 +190,7 @@ optimal query execution plan.
|
|||||||
6. Support for keys and values of zero length, including sorted
|
6. Support for keys and values of zero length, including sorted
|
||||||
duplicates.
|
duplicates.
|
||||||
|
|
||||||
7. Ability to assign up to 3 markers to commiting transaction with
|
7. Ability to assign up to 3 persistent 64-bit markers to commiting transaction with
|
||||||
`mdbx_canary_put()` and then get them in read transaction by
|
`mdbx_canary_put()` and then get them in read transaction by
|
||||||
`mdbx_canary_get()`.
|
`mdbx_canary_get()`.
|
||||||
|
|
||||||
@ -346,7 +335,7 @@ performance bottleneck in `MAPASYNC` mode.
|
|||||||
> storage then it's much more preferable to use `std::map`.
|
> storage then it's much more preferable to use `std::map`.
|
||||||
|
|
||||||
|
|
||||||
4. LMDB has a problem of long-time readers which degrades performance
|
4. _LMDB_ has a problem of long-time readers which degrades performance
|
||||||
and bloats DB.
|
and bloats DB.
|
||||||
> _libmdbx_ addresses that, details below.
|
> _libmdbx_ addresses that, details below.
|
||||||
|
|
||||||
@ -357,56 +346,41 @@ of data.
|
|||||||
> Details below.
|
> Details below.
|
||||||
|
|
||||||
|
|
||||||
#### Long-time read transactions problem
|
#### Problem of long-time reading
|
||||||
Garbage collection problem exists in all databases one way or another
|
Garbage collection problem exists in all databases one way or another
|
||||||
(e.g. VACUUM in PostgreSQL). But in _libmdbx_ and LMDB it's even more
|
(e.g. VACUUM in PostgreSQL). But in _libmdbx_ and LMDB it's even more
|
||||||
important because of high performance and deliberate simplification of
|
discernible because of high transaction rate and intentional internals
|
||||||
internals with emphasis on performance.
|
simplification in favor of performance.
|
||||||
|
|
||||||
* Altering data during long read operation may exhaust available space
|
Understanding the problem requires some explanation, but can be
|
||||||
on persistent storage.
|
difficult for quick perception. So is is reasonable
|
||||||
|
to simplify this as follows:
|
||||||
|
|
||||||
* If available space is exhausted then any attempt to update data
|
* Massive altering of data during a parallel long read operation may
|
||||||
results in `MAP_FULL` error until long read operation ends.
|
exhaust the free DB space.
|
||||||
|
|
||||||
* Main examples of long readers is hot backup and debugging of client
|
* If the available space is exhausted, any attempt to update the data
|
||||||
application which actively uses read transactions.
|
* will cause a "MAP_FULL" error until a long read transaction is completed.
|
||||||
|
|
||||||
|
* A good example of long readers is a hot backup or debugging of
|
||||||
|
a client application while retaining an active read transaction.
|
||||||
|
|
||||||
* In _LMDB_ this results in degraded performance of all operations of
|
* In _LMDB_ this results in degraded performance of all operations of
|
||||||
syncing data to persistent storage.
|
writing data to persistent storage.
|
||||||
|
|
||||||
* _libmdbx_ has a mechanism which aborts such operations and `LIFO RECLAIM`
|
* _libmdbx_ has the `OOM-KICK` mechanism which allow to abort such
|
||||||
mode which addresses performance degradation.
|
operations and the `LIFO RECLAIM` mode which addresses performance
|
||||||
|
degradation.
|
||||||
|
|
||||||
Read operations operate only over snapshot of DB which is consistent on
|
#### Durability in asynchronous writing mode
|
||||||
the moment when read transaction started. This snapshot doesn't change
|
In `WRITEMAP+MAPSYNC` mode updated (aka dirty) pages are written
|
||||||
throughout the transaction but this leads to inability to reclaim the
|
to persistent storage by the OS kernel. This means that if the
|
||||||
pages until read transaction ends.
|
application fails, the OS kernel will finish writing all updated
|
||||||
|
data to disk and nothing will be lost.
|
||||||
In _LMDB_ this leads to a problem that memory pages, allocated for
|
However, in the case of hardware malfunction or OS kernel fatal error,
|
||||||
operations during long read, will be used for operations and won't be
|
only some updated data can be written to disk and the database structure
|
||||||
reclaimed until DB process terminates. In _LMDB_ they are used in
|
is likely to be destroyed.
|
||||||
[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))
|
In such situation, DB is completely corrupted and can't be repaired.
|
||||||
manner, which causes increased page count and less chance of cache hit
|
|
||||||
during I/O. In other words: one long-time reader can impact performance
|
|
||||||
of all database until it'll be reopened.
|
|
||||||
|
|
||||||
_libmdbx_ addresses the problem, details below. Illustrations to this
|
|
||||||
problem can be found in the
|
|
||||||
[presentation](http://www.slideshare.net/leoyuriev/lmdb). There is also
|
|
||||||
example of performance increase thanks to
|
|
||||||
[BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration)
|
|
||||||
when `LIFO RECLAIM` enabled in _libmdbx_.
|
|
||||||
|
|
||||||
#### Data safety in async-write mode
|
|
||||||
In `WRITEMAP+MAPSYNC` mode dirty pages are written to persistent storage
|
|
||||||
by kernel. This means that in case of application crash OS kernel will
|
|
||||||
write all dirty data to disk and nothing will be lost. But in case of
|
|
||||||
hardware malfunction or OS kernel fatal error only some dirty data might
|
|
||||||
be synced to disk, and there is high probability that pages with
|
|
||||||
metadata saved, will point to non-saved, hence non-existent, data pages.
|
|
||||||
In such situation, DB is completely corrupted and can't be repaired even
|
|
||||||
if there was full sync before the crash via `mdbx_env_sync().
|
|
||||||
|
|
||||||
_libmdbx_ addresses this by fully reimplementing write path of data:
|
_libmdbx_ addresses this by fully reimplementing write path of data:
|
||||||
|
|
||||||
@ -414,39 +388,38 @@ _libmdbx_ addresses this by fully reimplementing write path of data:
|
|||||||
instead their shadow copies are used and their updates are synced after
|
instead their shadow copies are used and their updates are synced after
|
||||||
data is flushed to disk.
|
data is flushed to disk.
|
||||||
|
|
||||||
* During transaction commit _libmdbx_ marks synchronization points as
|
* During transaction commit _libmdbx_ marks it as a steady or weak
|
||||||
steady or weak depending on how much synchronization needed between RAM
|
depending on synchronization status between RAM and persistent storage.
|
||||||
and persistent storage, e.g. in `WRITEMAP+MAPSYNC` commited transactions
|
For instance, in the `WRITEMAP+MAPSYNC` mode committed transactions
|
||||||
are marked as weak, but during explicit data synchronization - as
|
are marked as weak by default, but as steady after explicit data flushes.
|
||||||
steady.
|
|
||||||
|
|
||||||
* _libmdbx_ maintains three separate meta-pages instead of two. This
|
* _libmdbx_ maintains three separate meta-pages instead of two. This
|
||||||
allows to commit transaction with steady or weak synchronization point
|
allows to commit transaction as steady or weak without losing two
|
||||||
without losing two previous synchronization points (one of them can be
|
previous commit points (one of them can be steady, and another
|
||||||
steady, and second - weak). This allows to order weak and steady
|
weak). Thus, after a fatal system failure, it will be possible to
|
||||||
synchronization points in any order without losing consistency in case
|
rollback to the last steady commit point.
|
||||||
of system crash.
|
|
||||||
|
|
||||||
* During DB open _libmdbx_ rollbacks to the last steady synchronization
|
* During DB open _libmdbx_ rollbacks to the last steady commit point,
|
||||||
point, this guarantees database integrity.
|
this guarantees database integrity after a crash. However, if the
|
||||||
|
database opening in read-only mode, such rollback cannot be performed
|
||||||
|
which will cause returning the MDBX_WANNA_RECOVERY error.
|
||||||
|
|
||||||
For data safety pages which form database snapshot with steady
|
For data integrity a pages which form database snapshot with steady
|
||||||
synchronization point must not be updated until next steady
|
commit point, must not be updated until next steady commit point.
|
||||||
synchronization point. So last steady synchronization point creates
|
Therefore the last steady commit point creates an effect analogues to "long-time read".
|
||||||
"long-time read" effect. The only difference that in case of memory
|
The only difference that now in case of space exhaustion the problem
|
||||||
exhaustion the problem will be immediately addressed by flushing changes
|
will be immediately addressed by writing changes to disk and forming
|
||||||
to persistent storage and forming new steady synchronization point.
|
the new steady commit point.
|
||||||
|
|
||||||
So in async-write mode _libmdbx_ will always use new pages until memory
|
So in async-write mode _libmdbx_ will always use new pages until the
|
||||||
is exhausted or `mdbx_env_sync()` is invoked. Total disk usage will be
|
free DB space will be exhausted or `mdbx_env_sync()` will be invoked,
|
||||||
almost the same as in sync-write mode.
|
and the total write traffic to the disk will be the same as in sync-write mode.
|
||||||
|
|
||||||
Current _libmdbx_ gives a choice of safe async-write mode (default) and
|
Currently libmdbx gives a choice between a safe async-write mode (default) and
|
||||||
`UTTERLY_NOSYNC` mode which may result in full DB corruption during
|
`UTTERLY_NOSYNC` mode which may lead to DB corruption after a system crash, i.e. like the LMDB.
|
||||||
system crash as with LMDB.
|
|
||||||
|
|
||||||
Next version of _libmdbx_ will create steady synchronization points
|
Next version of _libmdbx_ will be automatically create steady commit
|
||||||
automatically in async-write mode.
|
points in async-write mode upon completion transfer data to the disk.
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user