Plasma GitLab Archive
Projects Blog Knowledge


Plasma Key/Value Databases

PlasmaKV is a library for accessing key/value databases stored in PlasmaFS filesystems. The library is accompanied by a command-line utility plasma_kv, which we use in the following to demontrate what PlasmaKV can do (see Cmd_plasma_kv for documentation).

A database maps unique keys to values, i.e. it is just a simple index like NDBM. The keys have a limited size which must be specified when the database is created. The values can have any size.

Using plasma_kv

Example: Create a new database /demo with keys up to 128 bytes length:

$ plasma_kv create -db /demo -max-key-length 128

This actually created three files:

$ plasma ls /demo.*
-rw-r--r-- gerd gerd         0 2011-10-27 00:36 
-rw-r--r-- gerd gerd         0 2011-10-27 00:27 demo.del  
-rw-r--r-- gerd gerd     65536 2011-10-27 00:36 demo.idx  

The file with suffix .data contains the key/value pairs in sequential order. In the file with .del as suffix pairs are accumulated that have been marked as deleted. The .idx suffix is used for the index mapping keys to locations in the .data file. The index is a B+ tree, and thus implicitly sorts the keys lexicographically.

Insert an entry:

$ plasma_kv insert -db /demo my_key <my_value

Now the key "my_key" maps to a value which is taken as the contents of the file "my_value". This insert function can also work over lists of keys (options -files and -keys-and-files).

Look the entry up:

$ plasma_kv lookup -db /demo my_key

There is also a way to delete keys, and to vacuum the database.


The Ocaml API is provided by the module Pkv_api. See there for details.


Now, why would you want to use PlasmaKV? It has a number of extremely interesting features:

  • Library, not server: Instead of contacting a database server, the programmer just uses the library, and accesses the shared filesystem from any machine of the cluster. Effectively, there is a lot of network interaction happening behind the scene, but the programmer does not see anything of this. It looks like as if a local file was accessed.
  • Huge databases are supported: Theoretically, the database can become so large that it fills up all PlasmaFS space on all connected disks. The format, at least, defines no limit. The consumed memory grows a bit because the complete blocklist is read into RAM, but it is highly compressed. Databases in the multi-terabyte range should be absolutely no problem.
  • The data is replicated: All PlasmaFS files support replication, i.e. every data block is stored on several different computers.
  • Outages: If a computer storing a data block crashes, PlasmaFS automatically uses a replica. Also, the allocation strategy implies a striping effect, which means that the load of the crashed computer is randomly distributed over the remaining computers storing data blocks.
  • Consistency: The databases are usually accessed and modified in the transactional mode. This means that a change is only made permanent when all parts of the changes are written to disk (comparable to the consistency guarantees a SQL database gives). Also, a change is never partially visible to readers (atomicity), but either only invisible or completely visible.
  • Concurrency: Of course, the databases can be read by many readers in parallel, no matter whether the readers run on the same or different machines. In addition to this, there can be one writer at a time. The writer does not lock out reader and vice versa, i.e it is possible to update the database while it is read-accessed, without interruption.
  • Asynchronous lookups: A few access methods, including lookups, are also available in an asynchronous variant for event-driven programming.
There are also some points that have a good and a bad side:

  • Deletes: When a key is deleted, it is just marked as deleted in the index, and an entry is added to the file with the .del suffix. The space consumed by the corresponding value is not immediately reclaimed. This means that deletes are fast, and the whole design does not have to take space reclamation into account (simplicity). The downside is that it is required to run a vacuum job every now and then to reclaim space. The vaccum job counts as writer, so other writes are not possible at this time (but reads are).
  • Locking: A writer acquires an inode lock from PlasmaFS, and prevents that other users write to the database. This type of lock is very simple, though, and one cannot wait until it is released. The only possible method is to busy-wait until the lock goes away.
  • Transactions: The transactional mode makes the databases fully ACID-compatible. However, there is a price to pay: First, opening the database may take some time (because block lists need to be downloaded from the namenode). Second, additional RAM is consumed, both in the namenode (where the data blocks included in the transaction need to be protected), and in the accessing client (for the block lists). There is also a non-transactional mode because of this.
  • Vacuum: The vacuum operation actually reads all entries, filters out the deleted ones, and writes the remaining ones into a new file. This is expensive. In the future there might be an alternative, though, because PlasmaFS allows it to cut holes into files, so larger unused space could also be reclaimed by deleting the corresponding blocks from the block list.


The typical application

  • needs only a simple database scheme, and one or two key/value databases are sufficient to represent the data
  • is totally dominated by read accesses
  • reads the database on several machines (for load balancing)
  • updates the database in batch jobs which are only run on one machine (especially no online (immediate) updates)


As PlasmaFS supports transactional file operations, the implementation of the database library is stunningly simple, and the core of the library takes less than 2000 lines of code. From the programmer's view it looks like as if we do not take any concurrent accesses, any consistency or atomicity problem into account. We just append new entries to the .data file, and update the .idx file in-place. The whole "trick" is to wrap these evident operations into snapshot transactions. So, what's that?

PlasmaFS supports transactions like SQL, and this means one can run file operations (like lookup, read, write, rename) between the special start_transaction and commit_transaction directives to get transactional behavior. The effects of the transacted operations is hidden from other users until commit_transaction has run successfully. We do not want to go too deeply into the implementation of this, but this makes already clear why we can just modify the files without having to take other users into account. First at commit_transaction time the other users can see the changes, and the other users see all changes at once (a commit is atomic).

Details of the transactional scheme

A particular detail of the machinery is quite interesting, though. When a data block of a file is modified, PlasmaFS does not allow it to modify the block in-place (i.e. to change the block directly). The primary reason is that this causes unmanageable consistency problems. Instead, PlasmaFS allocates a replacement block somewhere, and stores a copy of the original block there after applying the requested changes. When this is done inside a transaction, these replacement blocks just accumulate, and at commit time, the block list is modified, and the so-far hidden replacement blocks become official. The original blocks are available over the whole time of the transaction, so when the file is read-accessed at the same time, the readers just access the original, unmodified blocks. To make the story complete, the original blocks are even specially protected after the commit when there are still readers. The readers keep their old view this way.

Now, this does not yet explain completely how it comes that readers have a consistent view. The mentioned mechanism works only for the case when readers have already accessed a block - it is guaranteed that once a block is read, the readers see the same state for the rest of their transaction. However, the readers would see committed changes of blocks the readers have not yet accessed but will access. This could cause inconsistencies. The solution is called "snapshot transaction" in PlasmaFS. This mode increases the isolation level, and the readers are guaranteed to see the state from the beginning of a transaction throughout the remaining transaction, even if writers commit in parallel. Snapshots are not implemented in the server, but just in the client - it is sufficient to download the complete block list to get the right level of protection from the server. Remember that the namenode does not know whether a block is actually accessed or not. It only knows which blocks it has made known to the accessing client. Because of this it is sufficient to download the block information for a file range to snaphot-protect this range. This is now just done for the complete block list, as we want to take a snapshot of the whole database.

All the complicated management of the transactions is implemented in the PlasmaFS namenode server. The PlasmaFS client just uses the provided file operations in a tricky way to get the right isolation level. For the database code everything becomes absolutely simplistic: By just enclosing the file operations into a snapshot transaction, all the fascinating concurreny, consistency, and atomicity properties are automatically ensured without additional coding.

Bypassing the namenode

In a larger PlasmaFS cluster, the namenode could become a bottleneck: all file transactions need the namenode at some place. For a high-performance database this is a problem, because the speed of the database should not be limited by overloading the namenode.

At least for reading the database, the implemented transactional scheme has an interesting side-effect. The namenode is, after opening the files, only needed for loading the parts of the block list that are not yet known to the client. However, we already mentioned that we load the whole block list at database opening time (for taking snapshots). The consequence is that the namenode is no longer accessed after that! All read accesses have now only to load data blocks and can do this directly from the data nodes.

The library provides a function for checking whether the database has been updated since the snapshot was taken. This function, of course, needs the namenode, but it is really the only one, and you normally call it only once in a while.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml