mirror of
https://github.com/isar/libmdbx.git
synced 2025-01-01 23:54:12 +08:00
mdbx: made README content less ugly.
Change-Id: I537ab63a2d8a1cd3b84d5865f689ee53a29d4ad4
This commit is contained in:
parent
4adb1ab2d8
commit
7c7d5f4434
167
README.md
167
README.md
@ -54,8 +54,8 @@ and free Continuous Integration service will be available.
|
||||
- [Main features](#main-features)
|
||||
- [Improvements over LMDB](#improvements-over-lmdb)
|
||||
- [Gotchas](#gotchas)
|
||||
- [Long-time read transactions problem](#long-time-read-transactions-problem)
|
||||
- [Data safety in async-write-mode](#data-safety-in-async-write-mode)
|
||||
- [Problem of long-time reading](#problem-of-long-time-reading)
|
||||
- [Durability in asynchronous writing mode](#durability-in-asynchronous-writing-mode)
|
||||
- [Performance comparison](#performance-comparison)
|
||||
- [Integral performance](#integral-performance)
|
||||
- [Read scalability](#read-scalability)
|
||||
@ -72,42 +72,31 @@ for performance under Linux and Windows.
|
||||
_libmdbx_ allows multiple processes to read and update several key-value
|
||||
tables concurrently, while being
|
||||
[ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal
|
||||
overhead and operation cost of Olog(N).
|
||||
overhead and Olog(N) operation cost.
|
||||
|
||||
_libmdbx_ provides
|
||||
[serializability](https://en.wikipedia.org/wiki/Serializability) and
|
||||
consistency of data after crash. Read-write transactions don't block
|
||||
read-only transactions and are
|
||||
[serialized](https://en.wikipedia.org/wiki/Serializability) by
|
||||
[mutex](https://en.wikipedia.org/wiki/Mutual_exclusion).
|
||||
_libmdbx_ enforce [serializability](https://en.wikipedia.org/wiki/Serializability) for writers by single [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) and affords [wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) for parallel readers without atomic/interlocked operations, while writing and reading transactions do not block each other.
|
||||
|
||||
_libmdbx_
|
||||
[wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)
|
||||
provides parallel read transactions without atomic operations or
|
||||
synchronization primitives.
|
||||
_libmdbx_ can guarantee consistency after crash depending of operation mode.
|
||||
|
||||
_libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and
|
||||
[mmap](https://en.wikipedia.org/wiki/Memory-mapped_file), doesn't use
|
||||
[WAL](https://en.wikipedia.org/wiki/Write-ahead_logging). This might
|
||||
have caveats for some workloads.
|
||||
[Memory-Mapping](https://en.wikipedia.org/wiki/Memory-mapped_file), doesn't use
|
||||
[WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) which
|
||||
might be a caveat for some workloads.
|
||||
|
||||
### Comparison with other DBs
|
||||
Because _libmdbx_ is currently overhauled, I think it's better to just
|
||||
link [chapter of Comparison with other
|
||||
databases](https://github.com/coreos/bbolt#comparison-with-other-databases)
|
||||
here.
|
||||
For now please refer to [chapter of "BoltDB comparison with other
|
||||
databases"](https://github.com/coreos/bbolt#comparison-with-other-databases)
|
||||
which is also (mostly) applicable to MDBX.
|
||||
|
||||
### History
|
||||
The _libmdbx_ design is based on [Lightning Memory-Mapped
|
||||
Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).
|
||||
Initial development was going in
|
||||
[ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project, about a
|
||||
year later it received separate development effort and in autumn 2015
|
||||
was isolated to separate project, which was [presented at Highload++
|
||||
Initial development was going in [ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project.
|
||||
About a year later libmdbx was isolated to separate project, which was [presented at Highload++
|
||||
2015 conference](http://www.highload.ru/2015/abstracts/1831.html).
|
||||
|
||||
Since early 2017 _libmdbx_ is used in [Fast PositiveTables](https://github.com/leo-yuriev/libfpta),
|
||||
by [Positive Technologies](https://www.ptsecurity.com).
|
||||
Since early 2017 _libmdbx_ is used in [Fast Positive Tables](https://github.com/leo-yuriev/libfpta),
|
||||
and development is funded by [Positive Technologies](https://www.ptsecurity.com).
|
||||
|
||||
#### Acknowledgments
|
||||
Howard Chu (Symas Corporation) - the author of LMDB, from which
|
||||
@ -143,10 +132,10 @@ don't use [atomic
|
||||
operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations).
|
||||
Readers don't block each other and aren't blocked by writers. Read
|
||||
performance scales linearly with CPU core count.
|
||||
> Though "connect to DB" (start of first read transaction in thread) and
|
||||
> Nonetheless, "connect to DB" (start of first read transaction in thread) and
|
||||
> "disconnect from DB" (shutdown or thread termination) requires to
|
||||
> acquire a lock to register/unregister current thread from "readers
|
||||
> table"
|
||||
> table".
|
||||
|
||||
5. Keys with multiple values are stored efficiently without key
|
||||
duplication, sorted by value, including integers (reasonable for
|
||||
@ -201,7 +190,7 @@ optimal query execution plan.
|
||||
6. Support for keys and values of zero length, including sorted
|
||||
duplicates.
|
||||
|
||||
7. Ability to assign up to 3 markers to commiting transaction with
|
||||
7. Ability to assign up to 3 persistent 64-bit markers to commiting transaction with
|
||||
`mdbx_canary_put()` and then get them in read transaction by
|
||||
`mdbx_canary_get()`.
|
||||
|
||||
@ -346,7 +335,7 @@ performance bottleneck in `MAPASYNC` mode.
|
||||
> storage then it's much more preferable to use `std::map`.
|
||||
|
||||
|
||||
4. LMDB has a problem of long-time readers which degrades performance
|
||||
4. _LMDB_ has a problem of long-time readers which degrades performance
|
||||
and bloats DB.
|
||||
> _libmdbx_ addresses that, details below.
|
||||
|
||||
@ -357,56 +346,41 @@ of data.
|
||||
> Details below.
|
||||
|
||||
|
||||
#### Long-time read transactions problem
|
||||
#### Problem of long-time reading
|
||||
Garbage collection problem exists in all databases one way or another
|
||||
(e.g. VACUUM in PostgreSQL). But in _libmdbx_ and LMDB it's even more
|
||||
important because of high performance and deliberate simplification of
|
||||
internals with emphasis on performance.
|
||||
discernible because of high transaction rate and intentional internals
|
||||
simplification in favor of performance.
|
||||
|
||||
* Altering data during long read operation may exhaust available space
|
||||
on persistent storage.
|
||||
Understanding the problem requires some explanation, but can be
|
||||
difficult for quick perception. So is is reasonable
|
||||
to simplify this as follows:
|
||||
|
||||
* If available space is exhausted then any attempt to update data
|
||||
results in `MAP_FULL` error until long read operation ends.
|
||||
* Massive altering of data during a parallel long read operation may
|
||||
exhaust the free DB space.
|
||||
|
||||
* Main examples of long readers is hot backup and debugging of client
|
||||
application which actively uses read transactions.
|
||||
* If the available space is exhausted, any attempt to update the data
|
||||
* will cause a "MAP_FULL" error until a long read transaction is completed.
|
||||
|
||||
* A good example of long readers is a hot backup or debugging of
|
||||
a client application while retaining an active read transaction.
|
||||
|
||||
* In _LMDB_ this results in degraded performance of all operations of
|
||||
syncing data to persistent storage.
|
||||
writing data to persistent storage.
|
||||
|
||||
* _libmdbx_ has a mechanism which aborts such operations and `LIFO RECLAIM`
|
||||
mode which addresses performance degradation.
|
||||
* _libmdbx_ has the `OOM-KICK` mechanism which allow to abort such
|
||||
operations and the `LIFO RECLAIM` mode which addresses performance
|
||||
degradation.
|
||||
|
||||
Read operations operate only over snapshot of DB which is consistent on
|
||||
the moment when read transaction started. This snapshot doesn't change
|
||||
throughout the transaction but this leads to inability to reclaim the
|
||||
pages until read transaction ends.
|
||||
|
||||
In _LMDB_ this leads to a problem that memory pages, allocated for
|
||||
operations during long read, will be used for operations and won't be
|
||||
reclaimed until DB process terminates. In _LMDB_ they are used in
|
||||
[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))
|
||||
manner, which causes increased page count and less chance of cache hit
|
||||
during I/O. In other words: one long-time reader can impact performance
|
||||
of all database until it'll be reopened.
|
||||
|
||||
_libmdbx_ addresses the problem, details below. Illustrations to this
|
||||
problem can be found in the
|
||||
[presentation](http://www.slideshare.net/leoyuriev/lmdb). There is also
|
||||
example of performance increase thanks to
|
||||
[BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration)
|
||||
when `LIFO RECLAIM` enabled in _libmdbx_.
|
||||
|
||||
#### Data safety in async-write mode
|
||||
In `WRITEMAP+MAPSYNC` mode dirty pages are written to persistent storage
|
||||
by kernel. This means that in case of application crash OS kernel will
|
||||
write all dirty data to disk and nothing will be lost. But in case of
|
||||
hardware malfunction or OS kernel fatal error only some dirty data might
|
||||
be synced to disk, and there is high probability that pages with
|
||||
metadata saved, will point to non-saved, hence non-existent, data pages.
|
||||
In such situation, DB is completely corrupted and can't be repaired even
|
||||
if there was full sync before the crash via `mdbx_env_sync().
|
||||
#### Durability in asynchronous writing mode
|
||||
In `WRITEMAP+MAPSYNC` mode updated (aka dirty) pages are written
|
||||
to persistent storage by the OS kernel. This means that if the
|
||||
application fails, the OS kernel will finish writing all updated
|
||||
data to disk and nothing will be lost.
|
||||
However, in the case of hardware malfunction or OS kernel fatal error,
|
||||
only some updated data can be written to disk and the database structure
|
||||
is likely to be destroyed.
|
||||
In such situation, DB is completely corrupted and can't be repaired.
|
||||
|
||||
_libmdbx_ addresses this by fully reimplementing write path of data:
|
||||
|
||||
@ -414,39 +388,38 @@ _libmdbx_ addresses this by fully reimplementing write path of data:
|
||||
instead their shadow copies are used and their updates are synced after
|
||||
data is flushed to disk.
|
||||
|
||||
* During transaction commit _libmdbx_ marks synchronization points as
|
||||
steady or weak depending on how much synchronization needed between RAM
|
||||
and persistent storage, e.g. in `WRITEMAP+MAPSYNC` commited transactions
|
||||
are marked as weak, but during explicit data synchronization - as
|
||||
steady.
|
||||
* During transaction commit _libmdbx_ marks it as a steady or weak
|
||||
depending on synchronization status between RAM and persistent storage.
|
||||
For instance, in the `WRITEMAP+MAPSYNC` mode committed transactions
|
||||
are marked as weak by default, but as steady after explicit data flushes.
|
||||
|
||||
* _libmdbx_ maintains three separate meta-pages instead of two. This
|
||||
allows to commit transaction with steady or weak synchronization point
|
||||
without losing two previous synchronization points (one of them can be
|
||||
steady, and second - weak). This allows to order weak and steady
|
||||
synchronization points in any order without losing consistency in case
|
||||
of system crash.
|
||||
allows to commit transaction as steady or weak without losing two
|
||||
previous commit points (one of them can be steady, and another
|
||||
weak). Thus, after a fatal system failure, it will be possible to
|
||||
rollback to the last steady commit point.
|
||||
|
||||
* During DB open _libmdbx_ rollbacks to the last steady synchronization
|
||||
point, this guarantees database integrity.
|
||||
* During DB open _libmdbx_ rollbacks to the last steady commit point,
|
||||
this guarantees database integrity after a crash. However, if the
|
||||
database opening in read-only mode, such rollback cannot be performed
|
||||
which will cause returning the MDBX_WANNA_RECOVERY error.
|
||||
|
||||
For data safety pages which form database snapshot with steady
|
||||
synchronization point must not be updated until next steady
|
||||
synchronization point. So last steady synchronization point creates
|
||||
"long-time read" effect. The only difference that in case of memory
|
||||
exhaustion the problem will be immediately addressed by flushing changes
|
||||
to persistent storage and forming new steady synchronization point.
|
||||
For data integrity a pages which form database snapshot with steady
|
||||
commit point, must not be updated until next steady commit point.
|
||||
Therefore the last steady commit point creates an effect analogues to "long-time read".
|
||||
The only difference that now in case of space exhaustion the problem
|
||||
will be immediately addressed by writing changes to disk and forming
|
||||
the new steady commit point.
|
||||
|
||||
So in async-write mode _libmdbx_ will always use new pages until memory
|
||||
is exhausted or `mdbx_env_sync()` is invoked. Total disk usage will be
|
||||
almost the same as in sync-write mode.
|
||||
So in async-write mode _libmdbx_ will always use new pages until the
|
||||
free DB space will be exhausted or `mdbx_env_sync()` will be invoked,
|
||||
and the total write traffic to the disk will be the same as in sync-write mode.
|
||||
|
||||
Current _libmdbx_ gives a choice of safe async-write mode (default) and
|
||||
`UTTERLY_NOSYNC` mode which may result in full DB corruption during
|
||||
system crash as with LMDB.
|
||||
Currently libmdbx gives a choice between a safe async-write mode (default) and
|
||||
`UTTERLY_NOSYNC` mode which may lead to DB corruption after a system crash, i.e. like the LMDB.
|
||||
|
||||
Next version of _libmdbx_ will create steady synchronization points
|
||||
automatically in async-write mode.
|
||||
Next version of _libmdbx_ will be automatically create steady commit
|
||||
points in async-write mode upon completion transfer data to the disk.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user