PlasmaKV is a library for accessing key/value databases stored in
PlasmaFS filesystems. The library is accompanied by a command-line
plasma_kv, which we use in the following to demontrate
what PlasmaKV can do (see
Cmd_plasma_kv for documentation).
A database maps unique keys to values, i.e. it is just a simple index like NDBM. The keys have a limited size which must be specified when the database is created. The values can have any size.
Example: Create a new database /demo with keys up to 128 bytes length:
$ plasma_kv create -db /demo -max-key-length 128
This actually created three files:
$ plasma ls /demo.* -rw-r--r-- gerd gerd 0 2011-10-27 00:36 demo.data -rw-r--r-- gerd gerd 0 2011-10-27 00:27 demo.del -rw-r--r-- gerd gerd 65536 2011-10-27 00:36 demo.idx
The file with suffix
.data contains the key/value pairs in
sequential order. In the file with
.del as suffix pairs are
accumulated that have been marked as deleted. The
.idx suffix is
used for the index mapping keys to locations in the
.data file. The
index is a B+ tree, and thus implicitly sorts the keys
Insert an entry:
$ plasma_kv insert -db /demo my_key <my_value
Now the key "my_key" maps to a value which is taken as the contents of
the file "my_value". This
insert function can also work over lists of
Look the entry up:
$ plasma_kv lookup -db /demo my_key
There is also a way to delete keys, and to vacuum the database.
The Ocaml API is provided by the module
Pkv_api. See there for
Now, why would you want to use PlasmaKV? It has a number of extremely interesting features:
.delsuffix. The space consumed by the corresponding value is not immediately reclaimed. This means that deletes are fast, and the whole design does not have to take space reclamation into account (simplicity). The downside is that it is required to run a vacuum job every now and then to reclaim space. The vaccum job counts as writer, so other writes are not possible at this time (but reads are).
The typical application
As PlasmaFS supports transactional file operations, the implementation
of the database library is stunningly simple, and the core of the
library takes less than 2000 lines of code. From the programmer's
view it looks like as if we do not take any concurrent accesses, any
consistency or atomicity problem into account. We just append new
entries to the
.data file, and update the
.idx file in-place. The
whole "trick" is to wrap these evident operations into snapshot
transactions. So, what's that?
PlasmaFS supports transactions like SQL, and this means one can run file
operations (like lookup, read, write, rename) between the special
commit_transaction directives to get
transactional behavior. The effects of the transacted operations
is hidden from other users until
commit_transaction has run
successfully. We do not want to go too deeply into the implementation
of this, but this makes already clear why we can just modify the files
without having to take other users into account. First at
commit_transaction time the other users can see the changes, and
the other users see all changes at once (a commit is atomic).
Details of the transactional scheme
A particular detail of the machinery is quite interesting, though. When a data block of a file is modified, PlasmaFS does not allow it to modify the block in-place (i.e. to change the block directly). The primary reason is that this causes unmanageable consistency problems. Instead, PlasmaFS allocates a replacement block somewhere, and stores a copy of the original block there after applying the requested changes. When this is done inside a transaction, these replacement blocks just accumulate, and at commit time, the block list is modified, and the so-far hidden replacement blocks become official. The original blocks are available over the whole time of the transaction, so when the file is read-accessed at the same time, the readers just access the original, unmodified blocks. To make the story complete, the original blocks are even specially protected after the commit when there are still readers. The readers keep their old view this way.
Now, this does not yet explain completely how it comes that readers have a consistent view. The mentioned mechanism works only for the case when readers have already accessed a block - it is guaranteed that once a block is read, the readers see the same state for the rest of their transaction. However, the readers would see committed changes of blocks the readers have not yet accessed but will access. This could cause inconsistencies. The solution is called "snapshot transaction" in PlasmaFS. This mode increases the isolation level, and the readers are guaranteed to see the state from the beginning of a transaction throughout the remaining transaction, even if writers commit in parallel. Snapshots are not implemented in the server, but just in the client - it is sufficient to download the complete block list to get the right level of protection from the server. Remember that the namenode does not know whether a block is actually accessed or not. It only knows which blocks it has made known to the accessing client. Because of this it is sufficient to download the block information for a file range to snaphot-protect this range. This is now just done for the complete block list, as we want to take a snapshot of the whole database.
All the complicated management of the transactions is implemented in the PlasmaFS namenode server. The PlasmaFS client just uses the provided file operations in a tricky way to get the right isolation level. For the database code everything becomes absolutely simplistic: By just enclosing the file operations into a snapshot transaction, all the fascinating concurreny, consistency, and atomicity properties are automatically ensured without additional coding.
Bypassing the namenode
In a larger PlasmaFS cluster, the namenode could become a bottleneck: all file transactions need the namenode at some place. For a high-performance database this is a problem, because the speed of the database should not be limited by overloading the namenode.
At least for reading the database, the implemented transactional scheme has an interesting side-effect. The namenode is, after opening the files, only needed for loading the parts of the block list that are not yet known to the client. However, we already mentioned that we load the whole block list at database opening time (for taking snapshots). The consequence is that the namenode is no longer accessed after that! All read accesses have now only to load data blocks and can do this directly from the data nodes.
The library provides a function for checking whether the database
has been updated since the snapshot was taken. This function, of
course, needs the namenode, but it is really the only one, and
you normally call it only once in a while.