mirror of
				https://github.com/isar/libmdbx.git
				synced 2025-10-31 03:29:01 +08:00 
			
		
		
		
	mdbx: fix english README.
This commit is contained in:
		
							
								
								
									
										294
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										294
									
								
								README.md
									
									
									
									
									
								
							| @@ -9,9 +9,9 @@ libmdbx | |||||||
|  |  | ||||||
| ### Project Status | ### Project Status | ||||||
|  |  | ||||||
| **MDBX is under _active development_**, database format and API aren't stable  | **MDBX is under _active development_**, database format and API aren't stable | ||||||
| at least until 2018Q2. New version won't be backwards compatible. Main focus of the rework is to provide | at least until 2018Q2. New version won't be backwards compatible. | ||||||
| clear and robust API and new features. | Main focus of the rework is to provide clear and robust API and new features. | ||||||
|  |  | ||||||
| ## Contents | ## Contents | ||||||
|  |  | ||||||
| @@ -19,8 +19,8 @@ clear and robust API and new features. | |||||||
|   - [Comparison with other DBs](#comparison-with-other-dbs) |   - [Comparison with other DBs](#comparison-with-other-dbs) | ||||||
|   - [History & Acknowledgements](#history) |   - [History & Acknowledgements](#history) | ||||||
| - [Main features](#main-features) | - [Main features](#main-features) | ||||||
| - [Perfomance comparison](#perfomance-comparison) | - [Performance comparison](#performance-comparison) | ||||||
|   - [Integral perfomance](#integral-perfomance) |   - [Integral performance](#integral-performance) | ||||||
|   - [Read scalability](#read-scalability) |   - [Read scalability](#read-scalability) | ||||||
|   - [Sync-write mode](#sync-write-mode) |   - [Sync-write mode](#sync-write-mode) | ||||||
|   - [Lazy-write mode](#lazy-write-mode) |   - [Lazy-write mode](#lazy-write-mode) | ||||||
| @@ -34,17 +34,17 @@ clear and robust API and new features. | |||||||
|  |  | ||||||
| ## Overview | ## Overview | ||||||
|  |  | ||||||
| _libmdbx_ is an embedded lightweight key-value database engine oriented for perfomance. | _libmdbx_ is an embedded lightweight key-value database engine oriented for performance. | ||||||
|  |  | ||||||
| _libmdbx_ allows multiple processes to read and update several key-value tables concurrently,  | _libmdbx_ allows multiple processes to read and update several key-value tables concurrently, | ||||||
| while being [ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal overhead and operation cost of Olog(N). | while being [ACID](https://en.wikipedia.org/wiki/ACID)-compliant, with minimal overhead and operation cost of Olog(N). | ||||||
|  |  | ||||||
| _libmdbx_ provides | _libmdbx_ provides | ||||||
| [serializability](https://en.wikipedia.org/wiki/Serializability) and consistency of data after crash. | [serializability](https://en.wikipedia.org/wiki/Serializability) and consistency of data after crash. | ||||||
| Read-write transactions don't block read-only transactions and are  | Read-write transactions don't block read-only transactions and are | ||||||
| [serialized](https://en.wikipedia.org/wiki/Serializability) by [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). | [serialized](https://en.wikipedia.org/wiki/Serializability) by [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). | ||||||
|  |  | ||||||
| _libmdbx_ [wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) provides parallel read transactions  | _libmdbx_ [wait-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) provides parallel read transactions | ||||||
| without atomic operations or synchronization primitives. | without atomic operations or synchronization primitives. | ||||||
|  |  | ||||||
| _libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and [mmap](https://en.wikipedia.org/wiki/Memory-mapped_file), | _libmdbx_ uses [B+Trees](https://en.wikipedia.org/wiki/B%2B_tree) and [mmap](https://en.wikipedia.org/wiki/Memory-mapped_file), | ||||||
| @@ -52,18 +52,18 @@ doesn't use [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging). This might | |||||||
|  |  | ||||||
| ### Comparison with other DBs | ### Comparison with other DBs | ||||||
|  |  | ||||||
| Because  _libmdbx_ is currently overhauled, I think it's better to just link  | Because  _libmdbx_ is currently overhauled, I think it's better to just link | ||||||
| [chapter of Comparison with other databases](https://github.com/coreos/bbolt#comparison-with-other-databases) here. | [chapter of Comparison with other databases](https://github.com/coreos/bbolt#comparison-with-other-databases) here. | ||||||
|  |  | ||||||
| ### History | ### History | ||||||
|  |  | ||||||
| _libmdbx_ design is based on [Lightning Memory-Mapped Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).  | _libmdbx_ design is based on [Lightning Memory-Mapped Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database). | ||||||
| Initial development was going in [ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project, about a year later it  | Initial development was going in [ReOpenLDAP](https://github.com/leo-yuriev/ReOpenLDAP) project, about a year later it | ||||||
| received separate development effort and in autumn 2015 was isolated to separate project, which was  | received separate development effort and in autumn 2015 was isolated to separate project, which was | ||||||
| [presented at Highload++ 2015 conference](http://www.highload.ru/2015/abstracts/1831.html). | [presented at Highload++ 2015 conference](http://www.highload.ru/2015/abstracts/1831.html). | ||||||
|  |  | ||||||
| Since early 2017 _libmdbx_ is used in [Fast Positive Tables](https://github.com/leo-yuriev/libfpta),  | Since early 2017 _libmdbx_ is used in [Fast Positive Tables](https://github.com/leo-yuriev/libfpta), | ||||||
| by [Positive Technologies](https://www.ptsecurity.ru). | by [Positive Technologies](https://www.ptsecurity.com). | ||||||
|  |  | ||||||
| #### Acknowledgements | #### Acknowledgements | ||||||
|  |  | ||||||
| @@ -75,56 +75,55 @@ which was used for begin development of LMDB. | |||||||
|  |  | ||||||
|  |  | ||||||
| Main features | Main features | ||||||
| ================= | ============= | ||||||
|  |  | ||||||
| _libmdbx_ inherits all keys features and characteristics from  | _libmdbx_ inherits all keys features and characteristics from | ||||||
| [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database): | [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database): | ||||||
|  |  | ||||||
| 1. Data is stored in ordered map, keys are always sorted, range lookups are supported. | 1. Data is stored in ordered map, keys are always sorted, range lookups are supported. | ||||||
|  |  | ||||||
| 2. Data is [mmaped](https://en.wikipedia.org/wiki/Memory-mapped_file) to memory of each worker DB process, read transactions are zero-copy  | 2. Data is [mmaped](https://en.wikipedia.org/wiki/Memory-mapped_file) to memory of each worker DB process, read transactions are zero-copy. | ||||||
|  |  | ||||||
| 3. Transactions are [ACID](https://en.wikipedia.org/wiki/ACID)-compliant, thanks to  | 3. Transactions are [ACID](https://en.wikipedia.org/wiki/ACID)-compliant, thanks to | ||||||
|    [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) and [CoW](https://en.wikipedia.org/wiki/Copy-on-write). |    [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) and [CoW](https://en.wikipedia.org/wiki/Copy-on-write). | ||||||
|    Writes are strongly serialized and aren't blocked by reads, transactions can't conflict with each other.  |    Writes are strongly serialized and aren't blocked by reads, transactions can't conflict with each other. | ||||||
|    Reads are guaranteed to get only commited data  |    Reads are guaranteed to get only commited data | ||||||
|    ([relaxing serializability](https://en.wikipedia.org/wiki/Serializability#Relaxing_serializability)). |    ([relaxing serializability](https://en.wikipedia.org/wiki/Serializability#Relaxing_serializability)). | ||||||
|  |  | ||||||
| 4. Reads and queries are [non-blocking](https://en.wikipedia.org/wiki/Non-blocking_algorithm),  | 4. Reads and queries are [non-blocking](https://en.wikipedia.org/wiki/Non-blocking_algorithm), | ||||||
|    don't use [atomic operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations).  |    don't use [atomic operations](https://en.wikipedia.org/wiki/Linearizability#High-level_atomic_operations). | ||||||
|    Readers don't block each other and aren't blocked by writers. Read perfomance scales linearly with CPU core count. |    Readers don't block each other and aren't blocked by writers. Read performance scales linearly with CPU core count. | ||||||
|    > Though "connect to DB" (start of first read transaction in thread) and "disconnect from DB" (shutdown or thread  |    > Though "connect to DB" (start of first read transaction in thread) and "disconnect from DB" (shutdown or thread | ||||||
|    > termination) requires to acquire a lock to register/unregister current thread from "readers table" |    > termination) requires to acquire a lock to register/unregister current thread from "readers table" | ||||||
|  |  | ||||||
| 5. Keys with multiple values are stored efficiently without key duplication, sorted by value, including intereger  | 5. Keys with multiple values are stored efficiently without key duplication, sorted by value, including integers | ||||||
|    (for secondary indexes). |    (reasonable for secondary indexes). | ||||||
|  |  | ||||||
| 6. Efficient operation on short fixed length keys, including integer ones. | 6. Efficient operation on short fixed length keys, including integer ones. | ||||||
|  |  | ||||||
| 7. [WAF](https://en.wikipedia.org/wiki/Write_amplification) (Write Amplification Factor) и RAF (Read Amplification Factor)  | 7. [WAF](https://en.wikipedia.org/wiki/Write_amplification) (Write Amplification Factor) и RAF (Read Amplification Factor) | ||||||
|    are Olog(N). |    are Olog(N). | ||||||
|  |  | ||||||
| 8. No [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) and transaction journal.  | 8. No [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) and transaction journal. | ||||||
|    In case of a crash no recovery needed. No need for regular maintenance. Backups can be made on the fly on working DB  |    In case of a crash no recovery needed. No need for regular maintenance. Backups can be made on the fly on working DB | ||||||
|    without freezing writers. |    without freezing writers. | ||||||
|  |  | ||||||
| 9. No custom memory management, all done with standard OS syscalls. | 9. No custom memory management, all done with standard OS syscalls. | ||||||
|  |  | ||||||
|  |  | ||||||
| Perfomance comparison | Performance comparison | ||||||
| ===================== | ===================== | ||||||
|  |  | ||||||
| All benchmark were done by multiple test runs on Lenovo Carbon-2 laptop, i7-4600U 2.1 ГГц, 8 Гб ОЗУ, SSD | All benchmarks were done by [IOArena](https://github.com/pmwkaa/ioarena) | ||||||
| SAMSUNG MZNTD512HAGL-000L1 (DXT23L0Q) 512 Gb. | and multiple [scripts](https://github.com/pmwkaa/ioarena/tree/HL%2B%2B2015) | ||||||
|  | runs on Lenovo Carbon-2 laptop, i7-4600U 2.1 GHz, 8 Gb RAM, | ||||||
| Benchmark: [_IOArena_](https://github.com/pmwkaa/ioarena)   | SSD SAMSUNG MZNTD512HAGL-000L1 (DXT23L0Q) 512 Gb. | ||||||
| [test scripts](https://github.com/pmwkaa/ioarena/tree/HL%2B%2B2015). |  | ||||||
|  |  | ||||||
| -------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||||||
|  |  | ||||||
| ### Integral perfomance | ### Integral performance | ||||||
|  |  | ||||||
| Here showed sum of perfomance metrics in 3 benchmarks: | Here showed sum of performance metrics in 3 benchmarks: | ||||||
|  |  | ||||||
|    - Read/Search on 4 CPU cores machine; |    - Read/Search on 4 CPU cores machine; | ||||||
|  |  | ||||||
| @@ -132,14 +131,14 @@ Here showed sum of perfomance metrics in 3 benchmarks: | |||||||
|      in sync-write mode (fdatasync is called after each transaction); |      in sync-write mode (fdatasync is called after each transaction); | ||||||
|  |  | ||||||
|    - Transactions with [CRUD](https://en.wikipedia.org/wiki/CRUD) operations |    - Transactions with [CRUD](https://en.wikipedia.org/wiki/CRUD) operations | ||||||
|      in lazy-write mode (moment to sync data to persistent storage is decided by OS); |      in lazy-write mode (moment to sync data to persistent storage is decided by OS). | ||||||
|  |  | ||||||
| *Reasons why asynchronous mode isn't benchmarked here:* | *Reasons why asynchronous mode isn't benchmarked here:* | ||||||
|  |  | ||||||
|   1. It doesn't make sense as it has to be done with DB engines, oriented for keeping data in memory e.g.  |   1. It doesn't make sense as it has to be done with DB engines, oriented for keeping data in memory e.g. | ||||||
|      [Tarantool](https://tarantool.io/), [Redis](https://redis.io/)), etc. |      [Tarantool](https://tarantool.io/), [Redis](https://redis.io/)), etc. | ||||||
|  |  | ||||||
|   2. perfomance gap is too high to compare in any meaningful way. |   2. Performance gap is too high to compare in any meaningful way. | ||||||
|  |  | ||||||
|  |  | ||||||
|  |  | ||||||
| @@ -147,7 +146,7 @@ Here showed sum of perfomance metrics in 3 benchmarks: | |||||||
|  |  | ||||||
| ### Read Scalability | ### Read Scalability | ||||||
|  |  | ||||||
| Summary perfomance with concurrent read/search queries in 1-2-4-8 threads on 4 CPU cores machine. | Summary performance with concurrent read/search queries in 1-2-4-8 threads on 4 CPU cores machine. | ||||||
|  |  | ||||||
|  |  | ||||||
|  |  | ||||||
| @@ -155,14 +154,14 @@ Summary perfomance with concurrent read/search queries in 1-2-4-8 threads on 4 C | |||||||
|  |  | ||||||
| ### Sync-write mode | ### Sync-write mode | ||||||
|  |  | ||||||
|  - Linear scale on left and dark rectangles mean arithmetic mean transactions per second |  - Linear scale on left and dark rectangles mean arithmetic mean transactions per second; | ||||||
|  |  | ||||||
|  - Logarithmic scale on right is in seconds and yellow intervals mean execution time of transactions.  |  - Logarithmic scale on right is in seconds and yellow intervals mean execution time of transactions. | ||||||
|    Each interval shows minimal and maximum execution time, cross marks standart deviation. |    Each interval shows minimal and maximum execution time, cross marks standard deviation. | ||||||
|  |  | ||||||
| **10,000 transactions in sync-write mode**. In case of a crash all data is consistent and state is right after last successful transaction. [fdatasync](https://linux.die.net/man/2/fdatasync) syscall is used after each write transaction in this mode. | **10,000 transactions in sync-write mode**. In case of a crash all data is consistent and state is right after last successful transaction. [fdatasync](https://linux.die.net/man/2/fdatasync) syscall is used after each write transaction in this mode. | ||||||
|  |  | ||||||
| In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete).  | In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete). | ||||||
| Benchmark starts on empty database and after full run the database contains 10,000 small key-value records. | Benchmark starts on empty database and after full run the database contains 10,000 small key-value records. | ||||||
|  |  | ||||||
|  |  | ||||||
| @@ -171,17 +170,17 @@ Benchmark starts on empty database and after full run the database contains 10,0 | |||||||
|  |  | ||||||
| ### Lazy-write mode | ### Lazy-write mode | ||||||
|  |  | ||||||
|  - Linear scale on left and dark rectangles mean arithmetic mean of thousands transactions per second |  - Linear scale on left and dark rectangles mean arithmetic mean of thousands transactions per second; | ||||||
|  |  | ||||||
|  - Logarithmic scale on right in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standart deviation.   |  - Logarithmic scale on right in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standard deviation. | ||||||
|  |  | ||||||
| **100,000 transactions in lazy-write mode**. | **100,000 transactions in lazy-write mode**. | ||||||
| In case of a crash all data is consistent and state is right after one of last transactions, but transactions after it  | In case of a crash all data is consistent and state is right after one of last transactions, but transactions after it | ||||||
| will be lost. Other DB engines use [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) or transaction journal for that,  | will be lost. Other DB engines use [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) or transaction journal for that, | ||||||
| which in turn depends on order of operations in journaled filesystem. _libmdbx_ doesn't use WAL and hands I/O operations  | which in turn depends on order of operations in journaled filesystem. _libmdbx_ doesn't use WAL and hands I/O operations | ||||||
| to filesystem and OS kernel (mmap). | to filesystem and OS kernel (mmap). | ||||||
|  |  | ||||||
| In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete).  | In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete). | ||||||
| Benchmark starts on empty database and after full run the database contains 100,000 small key-value records. | Benchmark starts on empty database and after full run the database contains 100,000 small key-value records. | ||||||
|  |  | ||||||
|  |  | ||||||
| @@ -191,13 +190,13 @@ Benchmark starts on empty database and after full run the database contains 100, | |||||||
|  |  | ||||||
| ### Async-write mode | ### Async-write mode | ||||||
|  |  | ||||||
|  - Linear scale on left and dark rectangles mean arithmetic mean of thousands transactions per second |  - Linear scale on left and dark rectangles mean arithmetic mean of thousands transactions per second; | ||||||
|  |  | ||||||
|  - Logarithmic scale on right in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standart deviation.   |  - Logarithmic scale on right in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standard deviation. | ||||||
|  |  | ||||||
| **1,000,000 transactions in async-write mode**. In case of a crash all data will be consistent and state will be right after one of last transactions, but lost transaction count is much higher than in lazy-write mode. All DB engines in this mode do as little writes as possible on persistent storage. _libmdbx_ uses [msync(MS_ASYNC)](https://linux.die.net/man/2/msync) in this mode. | **1,000,000 transactions in async-write mode**. In case of a crash all data will be consistent and state will be right after one of last transactions, but lost transaction count is much higher than in lazy-write mode. All DB engines in this mode do as little writes as possible on persistent storage. _libmdbx_ uses [msync(MS_ASYNC)](https://linux.die.net/man/2/msync) in this mode. | ||||||
|  |  | ||||||
| In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete).  | In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete). | ||||||
| Benchmark starts on empty database and after full run the database contains 10,000 small key-value records. | Benchmark starts on empty database and after full run the database contains 10,000 small key-value records. | ||||||
|  |  | ||||||
|  |  | ||||||
| @@ -208,12 +207,12 @@ Benchmark starts on empty database and after full run the database contains 10,0 | |||||||
|  |  | ||||||
| Summary of used resources during lazy-write mode benchmarks: | Summary of used resources during lazy-write mode benchmarks: | ||||||
|  |  | ||||||
|  - read and write IOPS |  - Read and write IOPS; | ||||||
|  |  | ||||||
|  - sum of user CPU time and sys CPU time |  - Sum of user CPU time and sys CPU time; | ||||||
|  |  | ||||||
|  - used space on persistent storage after the test and closed DB, but not waiting for the end of all internal  |  - Used space on persistent storage after the test and closed DB, but not waiting for the end of all internal | ||||||
|    housekeeping operations (LSM compactification, etc) |    housekeeping operations (LSM compactification, etc). | ||||||
|  |  | ||||||
| _ForestDB_ is excluded because benchmark showed it's resource consumption for each resource (CPU, IOPS) much higher than other engines which prevents to meaningfully compare it with them. | _ForestDB_ is excluded because benchmark showed it's resource consumption for each resource (CPU, IOPS) much higher than other engines which prevents to meaningfully compare it with them. | ||||||
|  |  | ||||||
| @@ -226,106 +225,105 @@ scanning data directory. | |||||||
|  |  | ||||||
| ## Gotchas | ## Gotchas | ||||||
|  |  | ||||||
| 1.  | 1. At one moment there can be only one writer. But this allows to serialize writes and eliminate any possibility | ||||||
|    At one moment there can be only one writer. But this allows to serialize writes and eliminate any possibility  |  | ||||||
|    of conflict or logical errors during transaction rollback. |    of conflict or logical errors during transaction rollback. | ||||||
|  |  | ||||||
| 2. No [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) means relatively  | 2. No [WAL](https://en.wikipedia.org/wiki/Write-ahead_logging) means relatively | ||||||
|    big [WAF](https://en.wikipedia.org/wiki/Write_amplification) (Write Amplification Factor).  |    big [WAF](https://en.wikipedia.org/wiki/Write_amplification) (Write Amplification Factor). | ||||||
|    Because of this syncing data to disk might be quite resource intensive and be main perfomance bottleneck  |    Because of this syncing data to disk might be quite resource intensive and be main performance bottleneck | ||||||
|    during intensive write workload. |    during intensive write workload. | ||||||
|    > As compromise _libmdbx_ allows several modes of lazy and/or periodic syncing, including `MAPASYNC` mode, which modificates |    > As compromise _libmdbx_ allows several modes of lazy and/or periodic syncing, including `MAPASYNC` mode, which modificate | ||||||
|    > data in memory and asynchronously syncs data to disc, moment to sync is picked by OS. |    > data in memory and asynchronously syncs data to disc, moment to sync is picked by OS. | ||||||
|    > |    > | ||||||
|    > Although this should be used with care, synchronous transactions in a DB with transaction journal will require 2 IOPS |    > Although this should be used with care, synchronous transactions in a DB with transaction journal will require 2 IOPS | ||||||
|    > minimum (probably 3-4 in practice) because of filesystem overhead, overhead depends on filesystem, not on record |    > minimum (probably 3-4 in practice) because of filesystem overhead, overhead depends on filesystem, not on record | ||||||
|    > count or record size. In _libmdbx_ IOPS count will grow logarithmically depending on record count in DB (height of B+ tree) |    > count or record size. In _libmdbx_ IOPS count will grow logarithmically depending on record count in DB (height of B+ tree) | ||||||
|    > and will require at least 2 IOPS per transaction too.  |    > and will require at least 2 IOPS per transaction too. | ||||||
|  |  | ||||||
| 3. [CoW](https://en.wikipedia.org/wiki/Copy-on-write) | 3. [CoW](https://en.wikipedia.org/wiki/Copy-on-write) | ||||||
|    for [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) is done on memory page level with [B+ |    for [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) is done on memory page level with [B+ | ||||||
|  trees](https://ru.wikipedia.org/wiki/B-%D0%B4%D0%B5%D1%80%D0%B5%D0%B2%D0%BE). |  trees](https://ru.wikipedia.org/wiki/B-%D0%B4%D0%B5%D1%80%D0%B5%D0%B2%D0%BE). | ||||||
|    Therefore altering data requires to copy about Olog(N) memory pages, which uses [memory bandwidth](https://en.wikipedia.org/wiki/Memory_bandwidth) and is main perfomance bottleneck in `MAPASYNC` mode. |    Therefore altering data requires to copy about Olog(N) memory pages, which uses [memory bandwidth](https://en.wikipedia.org/wiki/Memory_bandwidth) and is main performance bottleneck in `MAPASYNC` mode. | ||||||
|    > This is unavoidable, but isn't that bad. Syncing data to disk requires much more similiar operations which will  |    > This is unavoidable, but isn't that bad. Syncing data to disk requires much more similar operations which will | ||||||
|    > be done by OS, therefore this is noticeable only if data sync to persistent storage is fully disabled.  |    > be done by OS, therefore this is noticeable only if data sync to persistent storage is fully disabled. | ||||||
|    > _libmdbx_ allows to safely save data to persistent storage with minimal perfomance overhead. If there is no need  |    > _libmdbx_ allows to safely save data to persistent storage with minimal performance overhead. If there is no need | ||||||
|    > to save data to persistent storage then it's much more preferrable to use `std::map`. |    > to save data to persistent storage then it's much more preferable to use `std::map`. | ||||||
|  |  | ||||||
|  |  | ||||||
| 4. LMDB has a problem of long-time readers which degrades perfomance and bloats DB | 4. LMDB has a problem of long-time readers which degrades performance and bloats DB | ||||||
|    > _libmdbx_ addresses that, details below. |    > _libmdbx_ addresses that, details below. | ||||||
|  |  | ||||||
| 5. _LMDB_ is susceptible to DB corruption in `WRITEMAP+MAPASYNC` mode. | 5. _LMDB_ is susceptible to DB corruption in `WRITEMAP+MAPASYNC` mode. | ||||||
|    _libmdbx_ in `WRITEMAP+MAPASYNC` guarantees DB integrity and consistency of data. |    _libmdbx_ in `WRITEMAP+MAPASYNC` guarantees DB integrity and consistency of data. | ||||||
|    > Additionaly there is an alternative: `UTTERLY_NOSYNC` mode. Details below |    > Additionally there is an alternative: `UTTERLY_NOSYNC` mode. Details below. | ||||||
|  |  | ||||||
|  |  | ||||||
| #### Long-time read transactions problem | #### Long-time read transactions problem | ||||||
|  |  | ||||||
| Garbage collection problem exists in all databases one way or another (e.g. VACUUM in PostgreSQL).  | Garbage collection problem exists in all databases one way or another (e.g. VACUUM in PostgreSQL). | ||||||
| But in _libmbdx_ and LMDB it's even more important because of high perfomance and deliberate | But in _libmbdx_ and LMDB it's even more important because of high performance and deliberate | ||||||
| simplification of internals with emphasis on perfomance. | simplification of internals with emphasis on performance. | ||||||
|  |  | ||||||
| * Altering data during long read operation may exhaust available space on persistent storage | * Altering data during long read operation may exhaust available space on persistent storage. | ||||||
|  |  | ||||||
| * If available space is exhausted then any attempt to update data  | * If available space is exhausted then any attempt to update data | ||||||
|   results in `MAP_FULL` error until long read operation ends |   results in `MAP_FULL` error until long read operation ends. | ||||||
|  |  | ||||||
| * Main examples of long readers is hot backup  | * Main examples of long readers is hot backup | ||||||
|   and debugging of client application which actively uses read transactions |   and debugging of client application which actively uses read transactions. | ||||||
|  |  | ||||||
| * In _LMDB_ this results in degraded perfomace of all operations  | * In _LMDB_ this results in degraded performance of all operations | ||||||
|   of syncing data to persistent storage. |   of syncing data to persistent storage. | ||||||
|  |  | ||||||
| * _libmdbx_ has a mechanism which aborts such operations and `LIFO RECLAIM` | * _libmdbx_ has a mechanism which aborts such operations and `LIFO RECLAIM` | ||||||
|   mode which addresses perfomance degradation. |   mode which addresses performance degradation. | ||||||
|  |  | ||||||
| Read operations operate only over snapshot of DB which is consistent on the moment when read transaction started.  | Read operations operate only over snapshot of DB which is consistent on the moment when read transaction started. | ||||||
| This snapshot doesn't change throughout the transaction but this leads to inability to reclaim the pages until  | This snapshot doesn't change throughout the transaction but this leads to inability to reclaim the pages until | ||||||
| read transaction ends. | read transaction ends. | ||||||
|  |  | ||||||
| In _LMDB_ this leads to a problem that memory pages, allocated for operations during long read, will be used for operations | In _LMDB_ this leads to a problem that memory pages, allocated for operations during long read, will be used for operations | ||||||
| and won't be reclaimed until DB process terminates. In _LMDB_ they are used in  | and won't be reclaimed until DB process terminates. In _LMDB_ they are used in | ||||||
| [FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)) manner, which causes increased page count  | [FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)) manner, which causes increased page count | ||||||
| and less chance of cache hit during I/O. In other words: one long-time reader can impact perfomance of all database  | and less chance of cache hit during I/O. In other words: one long-time reader can impact performance of all database | ||||||
| until it'll be reopened. | until it'll be reopened. | ||||||
|  |  | ||||||
| _libmdbx_ addresses the problem, details below. Illustrations to this problem can be found in the  | _libmdbx_ addresses the problem, details below. Illustrations to this problem can be found in the | ||||||
| [presentation](http://www.slideshare.net/leoyuriev/lmdb). There is also example of perfomance increase thanks to  | [presentation](http://www.slideshare.net/leoyuriev/lmdb). There is also example of performance increase thanks to | ||||||
| [BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration) when `LIFO RECLAIM` enabled in _libmdbx_. | [BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration) when `LIFO RECLAIM` enabled in _libmdbx_. | ||||||
|  |  | ||||||
| #### Data safety in async-write mode | #### Data safety in async-write mode | ||||||
|  |  | ||||||
| In `WRITEMAP+MAPSYNC` mode dirty pages are written to persistent storage by kernel. This means that in case of application  | In `WRITEMAP+MAPSYNC` mode dirty pages are written to persistent storage by kernel. This means that in case of application | ||||||
| crash OS kernel will write all dirty data to disk and nothing will be lost. But in case of hardware malfunction or OS kernel  | crash OS kernel will write all dirty data to disk and nothing will be lost. But in case of hardware malfunction or OS kernel | ||||||
| fatal error only some dirty data might be synced to disk, and there is high probability that pages with metadata saved,  | fatal error only some dirty data might be synced to disk, and there is high probability that pages with metadata saved, | ||||||
| will point to non-saved, hence non-existent, data pages. In such situation DB is completely corrupted and can't be  | will point to non-saved, hence non-existent, data pages. In such situation DB is completely corrupted and can't be | ||||||
| repaired even if there was full sync before the crash via `mdbx_env_sync(). | repaired even if there was full sync before the crash via `mdbx_env_sync(). | ||||||
|  |  | ||||||
| _libmdbx_ addresses this by fully reimplementing write path of data: | _libmdbx_ addresses this by fully reimplementing write path of data: | ||||||
|  |  | ||||||
| * In `WRITEMAP+MAPSYNC` mode meta-data pages aren't updated in place, instead their shadow copies are used and their updates  | * In `WRITEMAP+MAPSYNC` mode meta-data pages aren't updated in place, instead their shadow copies are used and their updates | ||||||
|   are synced after data is flushed to disk. |   are synced after data is flushed to disk. | ||||||
|  |  | ||||||
| * During transaction commit _libmdbx_ marks synchronization points as steady or weak depending on how much synchronization  | * During transaction commit _libmdbx_ marks synchronization points as steady or weak depending on how much synchronization | ||||||
|   needed between RAM and persistent storage, e.g. in `WRITEMAP+MAPSYNC` commited transactions are marked as weak,  |   needed between RAM and persistent storage, e.g. in `WRITEMAP+MAPSYNC` commited transactions are marked as weak, | ||||||
|   but during explicit data synchronization - as steady. |   but during explicit data synchronization - as steady. | ||||||
|  |  | ||||||
| * _libmdbx_ maintains three separate meta-pages instead of two. This allows to commit transaction with steady or  | * _libmdbx_ maintains three separate meta-pages instead of two. This allows to commit transaction with steady or | ||||||
| weak synchronization point without losing two previous synchronization points (one of them can be steady, and second - weak).  | weak synchronization point without losing two previous synchronization points (one of them can be steady, and second - weak). | ||||||
| This allows to order weak and steady synchronization points in any order without losing consistency in case of system crash.  | This allows to order weak and steady synchronization points in any order without losing consistency in case of system crash. | ||||||
|  |  | ||||||
| * During DB open _libmdbx_ rollbacks to the last steady synchronization point, this guarantees database integrity. | * During DB open _libmdbx_ rollbacks to the last steady synchronization point, this guarantees database integrity. | ||||||
|  |  | ||||||
| For data safety pages which form database snapshot with steady synchronization point must not be updated until next steady  | For data safety pages which form database snapshot with steady synchronization point must not be updated until next steady | ||||||
| synchronization point. So last steady synchronization point creates "long-time read" effect. The only difference that in case  | synchronization point. So last steady synchronization point creates "long-time read" effect. The only difference that in case | ||||||
| of memory exhaustion the problem will be immediatly addressed by flushing changes to persistent storage and forming new steady  | of memory exhaustion the problem will be immediately addressed by flushing changes to persistent storage and forming new steady | ||||||
| synchronization point. | synchronization point. | ||||||
|  |  | ||||||
| So in async-write mode _libmdbx_ will always use new pages until memory is exhausted or `mdbx_env_sync()`is invoked. Total  | So in async-write mode _libmdbx_ will always use new pages until memory is exhausted or `mdbx_env_sync()`is invoked. Total | ||||||
| disk usage will be almost the same as in sync-write mode. | disk usage will be almost the same as in sync-write mode. | ||||||
|  |  | ||||||
| Current _libmdbx_ gives a choice of safe async-write mode (default) and `UTTERLY_NOSYNC` mode which may result in full DB  | Current _libmdbx_ gives a choice of safe async-write mode (default) and `UTTERLY_NOSYNC` mode which may result in full DB | ||||||
| corruption during system crash as with LMDB. | corruption during system crash as with LMDB. | ||||||
|  |  | ||||||
| Next version of _libmdbx_ will create steady synchronization points automatically in async-write mode. | Next version of _libmdbx_ will create steady synchronization points automatically in async-write mode. | ||||||
| @@ -337,13 +335,13 @@ Improvements over LMDB | |||||||
|  |  | ||||||
| 1. `LIFO RECLAIM` mode: | 1. `LIFO RECLAIM` mode: | ||||||
|  |  | ||||||
| 	The newest pages are picked for reuse instead of the oldest.  | 	The newest pages are picked for reuse instead of the oldest. | ||||||
| 	This allows to minimize reclaim loop and make it execution time independent from total page count. | 	This allows to minimize reclaim loop and make it execution time independent from total page count. | ||||||
|  |  | ||||||
| 	This results in OS kernel cache mechanisms working with maximum efficiency. | 	This results in OS kernel cache mechanisms working with maximum efficiency. | ||||||
| 	In case of using disc controllers or storages with  | 	In case of using disc controllers or storages with | ||||||
| 	[BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration) this may greatly improve | 	[BBWC](https://en.wikipedia.org/wiki/Disk_buffer#Write_acceleration) this may greatly improve | ||||||
| 	write perfomance. | 	write performance. | ||||||
|  |  | ||||||
| 2. `OOM-KICK` callback. | 2. `OOM-KICK` callback. | ||||||
|  |  | ||||||
| @@ -358,88 +356,88 @@ Improvements over LMDB | |||||||
|  |  | ||||||
| 	* abort or restart offending read transaction if it's running in sibling thread; | 	* abort or restart offending read transaction if it's running in sibling thread; | ||||||
|  |  | ||||||
| 	* abort current write transaction with returning error code | 	* abort current write transaction with returning error code. | ||||||
|  |  | ||||||
| 3. Guarantee of DB integrity in `WRITEMAP+MAPSYNC` mode: | 3. Guarantee of DB integrity in `WRITEMAP+MAPSYNC` mode: | ||||||
|  |   > Current _libmdbx_ gives a choice of safe async-write mode (default) | ||||||
| Current _libmdbx_ gives a choice of safe async-write mode (default) and `UTTERLY_NOSYNC` mode which may result in full  |   > and `UTTERLY_NOSYNC` mode which may result in full | ||||||
| DB corruption during system crash as with LMDB. For details see  |   > DB corruption during system crash as with LMDB. For details see | ||||||
| [Data safety in async-write mode](#data-safety-in-async-write-mode) |   > [Data safety in async-write mode](#data-safety-in-async-write-mode). | ||||||
|  |  | ||||||
| 4. Automatic creation of synchronization points (flush changes to persistent storage) | 4. Automatic creation of synchronization points (flush changes to persistent storage) | ||||||
|    when changes reach set threshold (threshold can be set by `mdbx_env_set_syncbytes()`). |    when changes reach set threshold (threshold can be set by `mdbx_env_set_syncbytes()`). | ||||||
|  |  | ||||||
| 5. Ability to get how far current readonly snapshot is from latest version of the DB by `mdbx_txn_straggler()` | 5. Ability to get how far current read-only snapshot is from latest version of the DB by `mdbx_txn_straggler()`. | ||||||
|  |  | ||||||
| 6. mdbx_chk tool for DB checking and `mdbx_env_pgwalk()` for pagewalking all pages in DB | 6. `mdbx_chk` tool for DB checking and `mdbx_env_pgwalk()` for page-walking all pages in DB. | ||||||
|  |  | ||||||
| 7. Control over debugging and receiveing of debugging messages via `mdbx_setup_debug()` | 7. Control over debugging and receiving of debugging messages via `mdbx_setup_debug()`. | ||||||
|  |  | ||||||
| 8. Ability to assign up to 3 markers to commiting transaction with `mdbx_canary_put()` and then get them in read transaction  | 8. Ability to assign up to 3 markers to commiting transaction with `mdbx_canary_put()` and then get them in read transaction | ||||||
|    by `mdbx_canary_get()` |    by `mdbx_canary_get()`. | ||||||
|  |  | ||||||
| 9. Check if there is a row with data after current cursor position via `mdbx_cursor_eof()` | 9. Check if there is a row with data after current cursor position via `mdbx_cursor_eof()`. | ||||||
|  |  | ||||||
| 10. Ability to explicitly request update of current record without creating new record. Implemented as `MDBX_CURRENT` flag  | 10. Ability to explicitly request update of current record without creating new record. Implemented as `MDBX_CURRENT` flag | ||||||
|     for `mdbx_put()` |     for `mdbx_put()`. | ||||||
|  |  | ||||||
| 11. Ability to update or delete record and get previous value via `mdbx_replace()` Also can update specific multi-value. | 11. Ability to update or delete record and get previous value via `mdbx_replace()` Also can update specific multi-value. | ||||||
|  |  | ||||||
| 12. Support for keys and values of zero length, including sorted duplicates | 12. Support for keys and values of zero length, including sorted duplicates. | ||||||
|  |  | ||||||
| 13. Fixed `mdbx_cursor_count()`, which returns correct count of duplicated for all table types and any cursor position | 13. Fixed `mdbx_cursor_count()`, which returns correct count of duplicated for all table types and any cursor position. | ||||||
|  |  | ||||||
| 14. Ability to open DB in exclusive mode via `mdbx_env_open_ex()`, e.g. for integrity check | 14. Ability to open DB in exclusive mode via `mdbx_env_open_ex()`, e.g. for integrity check. | ||||||
|  |  | ||||||
| 15. Ability to close DB in "dirty" state (without data flush and creation of steady synchronization point)  | 15. Ability to close DB in "dirty" state (without data flush and creation of steady synchronization point) | ||||||
|     via `mdbx_env_close_ex()` |     via `mdbx_env_close_ex()`. | ||||||
|  |  | ||||||
| 16. Ability to get addition info, including number of the oldest snapshot of DB, which is used by one of the readers.  | 16. Ability to get addition info, including number of the oldest snapshot of DB, which is used by one of the readers. | ||||||
|     Implemented via `mdbx_env_info()` |     Implemented via `mdbx_env_info()`. | ||||||
|  |  | ||||||
| 17. `mdbx_del()` doesn't ignore additional argument (specifier) `data` | 17. `mdbx_del()` doesn't ignore additional argument (specifier) `data` | ||||||
|      for tables without duplicates (without flag `MDBX_DUPSORT`), if `data` is not zero then always uses it to verify  |      for tables without duplicates (without flag `MDBX_DUPSORT`), if `data` is not zero then always uses it to verify | ||||||
|      record, which is being deleted |      record, which is being deleted. | ||||||
|  |  | ||||||
| 18. Ability to open dbi-table with simultaneous setup of comparators for keys and values, via `mdbx_dbi_open_ex()` | 18. Ability to open dbi-table with simultaneous setup of comparators for keys and values, via `mdbx_dbi_open_ex()`. | ||||||
|  |  | ||||||
| 19. Ability to find out if key or value are in dirty page. This may be useful to make a decision to avoid | 19. Ability to find out if key or value are in dirty page. This may be useful to make a decision to avoid | ||||||
|     excessive CoW before updates. Implemented via `mdbx_is_dirty()` |     excessive CoW before updates. Implemented via `mdbx_is_dirty()`. | ||||||
|  |  | ||||||
| 20. Correct update of current recordi in `MDBX_CURRENT` mode of `mdbx_cursor_put()`, including sorted duplicated. | 20. Correct update of current record in `MDBX_CURRENT` mode of `mdbx_cursor_put()`, including sorted duplicated. | ||||||
|  |  | ||||||
| 21. All cursors in all read and write transactions can be reused by `mdbx_cursor_renew()` and MUST be freed explicitly. | 21. All cursors in all read and write transactions can be reused by `mdbx_cursor_renew()` and MUST be freed explicitly. | ||||||
|   > ## Caution |   > ## Caution, please pay attention! | ||||||
|   >  |   > | ||||||
|   > This is the only change of API, which changes semantics of cursor management  |   > This is the only change of API, which changes semantics of cursor management | ||||||
|   > and can lead to memory leaks on misuse. This is a needed change as it eliminates ambiguity  |   > and can lead to memory leaks on misuse. This is a needed change as it eliminates ambiguity | ||||||
|   > which helps to avoid such errors as: |   > which helps to avoid such errors as: | ||||||
|   >  - use-after-free; |   >  - use-after-free; | ||||||
|   >  - double-free; |   >  - double-free; | ||||||
|   >  - memory corruption and segfaults. |   >  - memory corruption and segfaults. | ||||||
|  |  | ||||||
| 22. Additional error code `MDBX_EMULTIVAL`, which is returned by `mdbx_put()` and | 22. Additional error code `MDBX_EMULTIVAL`, which is returned by `mdbx_put()` and | ||||||
|     `mdbx_replace()` in case os ambigous update or delete. |     `mdbx_replace()` in case is ambiguous update or delete. | ||||||
|  |  | ||||||
| 23. Ability to get value by key and duplicates count by `mdbx_get_ex()` | 23. Ability to get value by key and duplicates count by `mdbx_get_ex()` | ||||||
|  |  | ||||||
| 24. Functions `mdbx_cursor_on_first() and mdbx_cursor_on_last(), which allows to know if cursor is currently on first or  | 24. Functions `mdbx_cursor_on_first() and mdbx_cursor_on_last(), which allows to know if cursor is currently on first or | ||||||
|     last position respectevely |     last position respectively. | ||||||
|  |  | ||||||
| 25. If read transaction is aborted via `mdbx_txn_abort()` or `mdbx_txn_reset()` then DBI-handles, which were opened in it,  | 25. If read transaction is aborted via `mdbx_txn_abort()` or `mdbx_txn_reset()` then DBI-handles, which were opened in it, | ||||||
|     aren't closed or deleted. This allows to avoid several types of hard-to-debug errors. |     aren't closed or deleted. This allows to avoid several types of hard-to-debug errors. | ||||||
|  |  | ||||||
| 26. Sequence generation via `mdbx_dbi_sequence()`. | 26. Sequence generation via `mdbx_dbi_sequence()`. | ||||||
|  |  | ||||||
| 27. Advanced dynamic control over DB size, including ability to choose page size via `mdbx_env_set_geometry()`,  | 27. Advanced dynamic control over DB size, including ability to choose page size via `mdbx_env_set_geometry()`, | ||||||
|     including on Windows |     including on Windows. | ||||||
|  |  | ||||||
| 28. Three meta-pages instead two, this allows to guarantee consistently update weak synchronisation points without risking to  | 28. Three meta-pages instead two, this allows to guarantee consistently update weak sync-points without risking to | ||||||
|     corrupt last steady synchronisation point. |     corrupt last steady sync-point. | ||||||
|  |  | ||||||
| 29. Automatic reclaim of freed pages to specific reserved space in the end of database file. This lowers amount of pages,  | 29. Automatic reclaim of freed pages to specific reserved space in the end of database file. This lowers amount of pages, | ||||||
|     loaded to memory, used in update/flush loop. In fact _llibmdbx_ constantly perfoms compactification of data,  |     loaded to memory, used in update/flush loop. In fact _libmdbx_ constantly performs compactification of data, | ||||||
|     but doesn't use addition resources for that. Space reclaim of DB and setup of database geometry parameters also decreases  |     but doesn't use addition resources for that. Space reclaim of DB and setup of database geometry parameters also decreases | ||||||
|     size of the database on disk, including on Windows. |     size of the database on disk, including on Windows. | ||||||
|  |  | ||||||
| -------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ||||||
|   | |||||||
		Reference in New Issue
	
	Block a user