PlasmaFS is a user-space file system that can be characterized as follows:
Theory of operation
This diagram shows a sample PlasmaFS cluster with four namenodes and four datanodes. The namenodes are backed by PostgreSQL database servers (the "pg" data stores). The datanodes are backed by "block stores", i.e. large files storing the blocks of the datanodes.
In general, the intelligence is in the namenodes and in the clients. The datanode servers are dumb, i.e. only respond to simple block read and write requests without knowing any context.
The client has to contact only one of the namenodes - the so-called coordinator is elected among the namenodes at cluster startup. This special namenode is the central synchronization instance. Here, the transactions "live", i.e. all temporary state resides in the coordinator that is required to manage a transaction. When a transaction modifies metadata, the data is replicated at commit time to the other namenodes. The client has nothing to do with this part as metadata replication is hidden from it.
The namenode data is stored in PostgreSQL databases. The replication from the coordinator to the other namenodes uses the two-phase commit protocol, providing consistency of the database contents at all times.
The data blocks are stored in the datanodes. The client has to manage replication and failover of data blocks: When the datablocks are replicated N times, it is the task of the client to write the datablocks to all datanode servers, including the servers holding the replicas. Conversely, when the client wants to read a datablock, it is the task of the client to choose one of the servers holding a replica, and to contact it. For both tasks, the coordinator helps as much as it can. For writes, it tells the client where it has to store the replicas. For reads, it provides the client with a list of replicas, and the information whether the datanodes are responsive.
The actual data stream may be organized differently, though. Especially, it is possible that the client only writes the first replica of a data block to the datanode directly, and directs this datanode to copy this block to the other datanodes (instead of writing it directly). This "chain topology" is sometimes preferable over the "star topology" regarding writes when the bandwidth between client and PlasmaFS cluster is lower than within the cluster.
For writing data blocks, PlasmaFS implements a special scheme that keeps the client off from overwriting the wrong data blocks. It is only allowed to write to those data blocks the coordinator grants access to within a transaction. This is achieved by handing out access tickets to the client. The client must show these tickets to the datanode server to get the permission to write to a data block. A ticket is only valid for a certain block in a certain transaction. (A similar scheme could be installed for securing read accesses; for various reasons this is not yet done, though the APIs are already designed for this.)
Because files are written in transactions, it is not allowed to overwrite blocks. It is only allowed to replace a block by a newer version. The problem is simply that a transaction can be rolled back - either by user request or by operational problems. Because of this, the old version of the datablock must stay in place until the transaction is committed. It is the task of the client to simulate block overwrites by reading the old version of the block, doing the data modification in the block, and finally writing out the replacement block to a different location in the datastore. (The client library supports this, so for user code this problem is transparent.)
Implementation
PlasmaFS is implemented in the programming language O'Caml. This means it is written in a safe language which is nevertheless very fast (compiled to machine code), and available for all major platforms. As PlasmaFS also uses multiprocessing to achieve parallelism, it is for now restricted to Unix-type systems (Linux, BSD, MacOSX, Solaris, ...).
All PlasmaFS operations are generally implemented in an asynchronous way. This primarily means that it breaks out of the strict "request-response" scheme that is often assumed by network servers, and in which the next request can only be processed when the previous one has been finished. In contrast to this, one can send PlasmaFS several requests in sequence, and PlasmaFS starts processing the following requests when they arrive, and independently of whether the previous requests have been completed. The responses to requests can be sent back to the client in any order.
The only restriction PlasmaFS employs is that single transactions follow the strictly synchronous "request-response" scheme (for clear semantics - imagine you would send a transaction commit before the preceding operations have finished - this restriction may be weakened in the future). As one can run several transactions at once on the same network connection, this is no big restriction, though.
For getting asynchronous execution, PlasmaFS is written on top of a user-space cooperative microthreading library (the Equeue library in Ocamlnet).
PlasmaFS is lock-free. For cooperative microthreading it is not
required to protect shared data with locks - they are implicitly
protected because the microthreads have full control about the
execution flow. Also, in the cases where the semantics implies
restrictions of what parallelly executed transaction can do
("transaction isolation"), PlasmaFS prefers to reject operations that
would violate the isolation guarantees rather than to wait until the
other transaction releases locks. In such cases the error code
ECONFLICT
is returned to the client, and it is up to the client to
resolve the problem. This ensures that the PlasmaFS namenode servers
are always responsive and never "lock up".
PlasmaFS uses PostgreSQL to store metadata. This ensures that the metadata is stored in a very reliable way - this is a mature database system that alone already guarantees ACID semantics. For example, it is not possible that the metadata gets corrupted because the disk fills up. Furthermore, PostgreSQL supports two-phase commit. PlasmaFS uses two-phase commit to distribute metadata writes to all namenodes. This guarantees that all namenodes are always in a consistent state, and that in case of a namenode failure, the exact state can be recovered that was seen by the last transaction commit. It is also a prerequisite to allow that namenode slaves can respond to read-only metadata requests (although this is not yet implemented).
PlasmaFS preallocates the data blocks on the datanodes. This allows it
that PlasmaFS can plan in advance on which datanodes it stores which
blocks, and that there is a very high likelihood that the plan can be
executed. It cannot happen that a datanode disk fills up because of
another process unrelated to PlasmaFS.