libmdbx/docs/_starting.md
Leonid Yuriev c8a0951566 mdbx: rework API and Docs around Handle-Slow-Readers (no algorithmic changes).
Change-Id: I5b76a8400ce6f5f241f8e4a7f53d746fe39f8e1e
2020-09-29 21:15:25 +03:00

243 lines
12 KiB
Markdown

Getting started {#starting}
===============
> This section is based on Bert Hubert's intro "LMDB Semantics", with
> edits reflecting the improvements and enhancements were made in MDBX.
> See Bert Hubert's [original](https://github.com/ahupowerdns/ahutils/blob/master/lmdb-semantics.md).
Everything starts with an environment, created by \ref mdbx_env_create().
Once created, this environment must also be opened with \ref mdbx_env_open(),
and after use be closed by \ref mdbx_env_close(). At that a non-zero value
of the last argument "mode" supposes MDBX will create database and directory
if ones does not exist. In this case the non-zero "mode" argument specifies
the file mode bits be applied when a new files are created by `open()` function.
Within that directory, a lock file (aka LCK-file) and a storage file (aka
DXB-file) will be generated. If you don't want to use a directory, you can
pass the \ref MDBX_NOSUBDIR option, in which case the path you provided is used
directly as the DXB-file, and another file with a "-lck" suffix added
will be used for the LCK-file.
Once the environment is open, a transaction can be created within it using
\ref mdbx_txn_begin(). Transactions may be read-write or read-only, and read-write
transactions may be nested. A transaction must only be used by one thread at
a time. Transactions are always required, even for read-only access. The
transaction provides a consistent view of the data.
Once a transaction has been created, a database (i.e. key-value space inside
the environment) can be opened within it using \ref mdbx_dbi_open(). If only one
database will ever be used in the environment, a `NULL` can be passed as the
database name. For named databases, the \ref MDBX_CREATE flag must be used to
create the database if it doesn't already exist. Also, \ref mdbx_env_set_maxdbs()
must be called after \ref mdbx_env_create() and before \ref mdbx_env_open() to set
the maximum number of named databases you want to support.
\note A single transaction can open multiple databases. Generally databases
should only be opened once, by the first transaction in the process.
Within a transaction, \ref mdbx_get() and \ref mdbx_put() can store single key-value
pairs if that is all you need to do (but see \ref Cursors below if you want to do
more).
A key-value pair is expressed as two \ref MDBX_val structures. This struct that is
exactly similar to POSIX's `struct iovec` and has two fields, `iov_len` and
`iov_base`. The data is a `void` pointer to an array of `iov_len` bytes.
\note The notable difference between MDBX and LMDB is that MDBX support zero
length keys.
Because MDBX is very efficient (and usually zero-copy), the data returned in
an \ref MDBX_val structure may be memory-mapped straight from disk. In other words
look but do not touch (or `free()` for that matter). Once a transaction is
closed, the values can no longer be used, so make a copy if you need to keep
them after that.
## Cursors {#Cursors}
To do more powerful things, we must use a cursor.
Within the transaction, a cursor can be created with \ref mdbx_cursor_open().
With this cursor we can store/retrieve/delete (multiple) values using
\ref mdbx_cursor_get(), \ref mdbx_cursor_put() and \ref mdbx_cursor_del().
The \ref mdbx_cursor_get() positions itself depending on the cursor operation
requested, and for some operations, on the supplied key. For example, to list
all key-value pairs in a database, use operation \ref MDBX_FIRST for the first
call to \ref mdbx_cursor_get(), and \ref MDBX_NEXT on subsequent calls, until
the end is hit.
To retrieve all keys starting from a specified key value, use \ref MDBX_SET. For
more cursor operations, see the \ref c_api reference.
When using \ref mdbx_cursor_put()\ref , either the function will position the cursor
for you based on the key, or you can use operation \ref MDBX_CURRENT to use the
current position of the cursor. \note Note that key must then match the current
position's key.
## Summarizing the opening
So we have a cursor in a transaction which opened a database in an
environment which is opened from a filesystem after it was separately
created.
Or, we create an environment, open it from a filesystem, create a transaction
within it, open a database within that transaction, and create a cursor
within all of the above.
Got it?
## Threads and processes
Do not have open an database twice in the same process at the same time, MDBX
will track and prevent this. Instead, share the MDBX environment that has
opened the file across all threads. The reason for this is:
- When the "Open file description" locks (aka OFD-locks) are not available,
MDBX uses POSIX locks on files, and these locks have issues if one process
opens a file multiple times.
- If a single process opens the same environment multiple times, closing it
once will remove all the locks held on it, and the other instances will be
vulnerable to corruption from other processes.
+ For compatibility with LMDB which allows multi-opening, MDBX can be
configured at runtime by \ref mdbx_setup_debug() with \ref MDBX_DBG_LEGACY_MULTIOPEN` option
prior to calling other MDBX functions. In this way MDBX will track
databases opening, detect multi-opening cases and then recover POSIX file
locks as necessary. However, lock recovery can cause unexpected pauses,
such as when another process opened the database in exclusive mode before
the lock was restored - we have to wait until such a process releases the
database, and so on.
Do not use opened MDBX environment(s) after `fork()` in a child process(es),
MDBX will check and prevent this at critical points. Instead, ensure there is
no open MDBX-instance(s) during fork(), or at least close it immediately after
`fork()` in the child process and reopen if required - for instance by using
`pthread_atfork()`. The reason for this is:
- For competitive consistent reading, MDBX assigns a slot in the shared
table for each process that interacts with the database. This slot is
populated with process attributes, including the PID.
- After `fork()`, in order to remain connected to a database, the child
process must have its own such "slot", which can't be assigned in any
simple and robust way another than the regular.
- A write transaction from a parent process cannot continue in a child
process for obvious reasons.
- Moreover, in a multithreaded process at the fork() moment any number of
threads could run in critical and/or intermediate sections of MDBX code
with interaction and/or racing conditions with threads from other
process(es). For instance: shrinking a database or copying it to a pipe,
opening or closing environment, beginning or finishing a transaction,
and so on.
= Therefore, any solution other than simply close database (and reopen if
necessary) in a child process would be both extreme complicated and so
fragile.
Do not start more than one transaction for a one thread. If you think about
this, it's really strange to do something with two data snapshots at once,
which may be different. MDBX checks and preventing this by returning
corresponding error code (\ref MDBX_TXN_OVERLAPPING, \ref MDBX_BAD_RSLOT, \ref MDBX_BUSY)
unless you using \ref MDBX_NOTLS option on the environment. Nonetheless, with the
\ref MDBX_NOTLS option, you must know exactly what you are doing, otherwise you
will get deadlocks or reading an alien data.
Also note that a transaction is tied to one thread by default using Thread
Local Storage. If you want to pass read-only transactions across threads,
you can use the \ref MDBX_NOTLS option on the environment. Nevertheless, a write
transaction entirely should only be used in one thread from start to finish.
MDBX checks this in a reasonable manner and return the \ref MDBX_THREAD_MISMATCH
error in rules violation.
## Transactions, rollbacks etc
To actually get anything done, a transaction must be committed using
\ref mdbx_txn_commit(). Alternatively, all of a transaction's operations
can be discarded using \ref mdbx_txn_abort().
\attention An important difference between MDBX and LMDB is that MDBX required
that any opened cursors can be reused and must be freed explicitly, regardless
ones was opened in a read-only or write transaction. The REASON for this is
eliminates ambiguity which helps to avoid errors such as: use-after-free,
double-free, i.e. memory corruption and segfaults.
For read-only transactions, obviously there is nothing to commit to storage.
\attention An another notable difference between MDBX and LMDB is that MDBX make
handles opened for existing databases immediately available for other
transactions, regardless this transaction will be aborted or reset. The
REASON for this is to avoiding the requirement for multiple opening a same
handles in concurrent read transactions, and tracking of such open but hidden
handles until the completion of read transactions which opened them.
In addition, as long as a transaction is open, a consistent view of the
database is kept alive, which requires storage. A read-only transaction that
no longer requires this consistent view should be terminated (committed or
aborted) when the view is no longer needed (but see below for an
optimization).
There can be multiple simultaneously active read-only transactions but only
one that can write. Once a single read-write transaction is opened, all
further attempts to begin one will block until the first one is committed or
aborted. This has no effect on read-only transactions, however, and they may
continue to be opened at any time.
## Duplicate keys aka Multi-values
\ref mdbx_get() and \ref mdbx_put() respectively have no and only some support or
multiple key-value pairs with identical keys. If there are multiple values
for a key, \ref mdbx_get() will only return the first value.
When multiple values for one key are required, pass the \ref MDBX_DUPSORT flag to
\ref mdbx_dbi_open(). In an \ref MDBX_DUPSORT database, by default \ref mdbx_put() will
not replace the value for a key if the key existed already. Instead it will add
the new value to the key. In addition, \ref mdbx_del() will pay attention to the
value field too, allowing for specific values of a key to be deleted.
Finally, additional cursor operations become available for traversing through
and retrieving duplicate values.
## Some optimization
If you frequently begin and abort read-only transactions, as an optimization,
it is possible to only reset and renew a transaction.
\ref mdbx_txn_reset() releases any old copies of data kept around for a read-only
transaction. To reuse this reset transaction, call \ref mdbx_txn_renew() on it.
Any cursors in this transaction can also be renewed using \ref mdbx_cursor_renew()
or freed by \ref mdbx_cursor_close().
To permanently free a transaction, reset or not, use \ref mdbx_txn_abort().
## Cleaning up
Any created cursors must be closed using \ref mdbx_cursor_close(). It is advisable
to repeat:
\note An important difference between MDBX and LMDB is that MDBX required that
any opened cursors can be reused and must be freed explicitly, regardless
ones was opened in a read-only or write transaction. The REASON for this is
eliminates ambiguity which helps to avoid errors such as: use-after-free,
double-free, i.e. memory corruption and segfaults.
It is very rarely necessary to close a database handle, and in general they
should just be left open. When you close a handle, it immediately becomes
unavailable for all transactions in the environment. Therefore, you should
avoid closing the handle while at least one transaction is using it.
## Now read up on the full API!
The full \ref c_api documentation lists further details below, like how to:
- Configure database size and automatic size management: \ref mdbx_env_set_geometry().
- Drop and clean a database: \ref mdbx_drop().
- Detect and report errors: \ref c_err.
- Optimize (bulk) loading speed: \ref MDBX_MULTIPLE, \ref MDBX_APPEND.
- Reduce (temporarily) robustness to gain even more speed: \ref sync_modes.
- Gather statistics about the database: \ref c_statinfo.
- Sstimate size of range query result: \ref c_rqest.
- Double performance by LIFO reclaiming on storages with write-back: \ref MDBX_LIFORECLAIM.
- Use sequences and canary markers: \ref mdbx_dbi_sequence(), \ref MDBX_canary.
- Use Handle-Slow-Readers callback to resolve a database full/overflow issues
due to long-lived read transactions: \ref mdbx_env_set_hsr().
- Use exclusive mode: \ref MDBX_EXCLUSIVE.
- Define custom sort orders (but this is recommended to be avoided).