Plasma GitLab Archive
Projects Blog Knowledge

Plasmafs_start



PlasmaFS - A distributed file system for large files

PlasmaFS is a user-space file system that can be characterized as follows:

  • Distributed: The file system runs on an unlimited number of nodes. The nodes are either namenodes or datanodes. Namenodes store meta information, datanodes store data blocks.
  • Replication: All data can be stored in multiple replicas. This is not only true for the data blocks, but also for metadata (i.e. any number of namenodes possible).
  • Transactional semantics: Both data and metadata accesses are done in a transactional manner. The file system guarantees ACID semantics (atomic operations, consistency of stored data, isolation of parellel accesses, durability of stored data).
  • Complete set of file operations: Unlike other DFS PlasmaFS supports all file operations, including random reads and writes.
  • Focus on large blocks: PlasmaFS has a uniform block size on all datanodes. The block size can be configured by the user, and reasonable ranges are 64K to 64M.
  • Fast recovery: When a data or namenode fails, the state of the overall system can be quickly recovered. Datanode failures are automatically handled, and are transparent to the user. Namenode failures require a restart of the namenode daemons, so that a different node can be elected as coordinator. No other actions are required - the metadata is available in a consistent way on all namenodes, and no data copying needs to take place in case of a failure. Of course, for both types of node failures the admin task remains to add replacement nodes. PlasmaFS tries to support the admin as much as it can.
  • Extensibility: Additional datanodes can be added to a PlasmaFS cluster without any service interruption. For adding namenodes, it is required to copy the metadata database to the new node, and to restart all namenode servers.
PlasmaFS file operations can be submitted in the following ways:

  • Native PlasmaFS RPC API: The full feature set is available in this API which is SunRPC-based. The primary client library is written in O'Caml, and supports synchronous and asynchronous file operations. As SunRPC is an Internet standard, it is easy to develop client libraries in any programming language for which SunRPC bindings are available. (This has not yet been done, though.) The native PlasmaFS API is a "must" to get highest performance and fully reliable accesses.
  • Command-line PlasmaFS client utility: Typical file operations can be done with this client utility which uses the O'Caml client library.
  • NFS bridge: PlasmaFS includes an NFS version 3 bridge - the bridge is an NFS server and forwards all file accesses to the PlasmaFS system. This allows users to mount PlasmaFS volumes directly. The NFS bridge is not as performant as the native interface (because you cannot bundle a set of file operations into a single transaction - for NFS every block access is a transaction of its own), but is very convenient because all programs can immediately access PlasmaFS files.
In the future it is also planned to provide a WebDAV bridge.

Theory of operation

This diagram shows a sample PlasmaFS cluster with four namenodes and four datanodes. The namenodes are backed by PostgreSQL database servers (the "pg" data stores). The datanodes are backed by "block stores", i.e. large files storing the blocks of the datanodes.

In general, the intelligence is in the namenodes and in the clients. The datanode servers are dumb, i.e. only respond to simple block read and write requests without knowing any context.

The client has to contact only one of the namenodes - the so-called coordinator is elected among the namenodes at cluster startup. This special namenode is the central synchronization instance. Here, the transactions "live", i.e. all temporary state resides in the coordinator that is required to manage a transaction. When a transaction modifies metadata, the data is replicated at commit time to the other namenodes. The client has nothing to do with this part as metadata replication is hidden from it.

The namenode data is stored in PostgreSQL databases. The replication from the coordinator to the other namenodes uses the two-phase commit protocol, providing consistency of the database contents at all times.

The data blocks are stored in the datanodes. The client has to manage replication and failover of data blocks: When the datablocks are replicated N times, it is the task of the client to write the datablocks to all datanode servers, including the servers holding the replicas. Conversely, when the client wants to read a datablock, it is the task of the client to choose one of the servers holding a replica, and to contact it. For both tasks, the coordinator helps as much as it can. For writes, it tells the client where it has to store the replicas. For reads, it provides the client with a list of replicas, and the information whether the datanodes are responsive.

The actual data stream may be organized differently, though. Especially, it is possible that the client only writes the first replica of a data block to the datanode directly, and directs this datanode to copy this block to the other datanodes (instead of writing it directly). This "chain topology" is sometimes preferable over the "star topology" regarding writes when the bandwidth between client and PlasmaFS cluster is lower than within the cluster.

For writing data blocks, PlasmaFS implements a special scheme that keeps the client off from overwriting the wrong data blocks. It is only allowed to write to those data blocks the coordinator grants access to within a transaction. This is achieved by handing out access tickets to the client. The client must show these tickets to the datanode server to get the permission to write to a data block. A ticket is only valid for a certain block in a certain transaction. (A similar scheme could be installed for securing read accesses; for various reasons this is not yet done, though the APIs are already designed for this.)

Because files are written in transactions, it is not allowed to overwrite blocks. It is only allowed to replace a block by a newer version. The problem is simply that a transaction can be rolled back - either by user request or by operational problems. Because of this, the old version of the datablock must stay in place until the transaction is committed. It is the task of the client to simulate block overwrites by reading the old version of the block, doing the data modification in the block, and finally writing out the replacement block to a different location in the datastore. (The client library supports this, so for user code this problem is transparent.)

Implementation

PlasmaFS is implemented in the programming language O'Caml. This means it is written in a safe language which is nevertheless very fast (compiled to machine code), and available for all major platforms. As PlasmaFS also uses multiprocessing to achieve parallelism, it is for now restricted to Unix-type systems (Linux, BSD, MacOSX, Solaris, ...).

All PlasmaFS operations are generally implemented in an asynchronous way. This primarily means that it breaks out of the strict "request-response" scheme that is often assumed by network servers, and in which the next request can only be processed when the previous one has been finished. In contrast to this, one can send PlasmaFS several requests in sequence, and PlasmaFS starts processing the following requests when they arrive, and independently of whether the previous requests have been completed. The responses to requests can be sent back to the client in any order.

The only restriction PlasmaFS employs is that single transactions follow the strictly synchronous "request-response" scheme (for clear semantics - imagine you would send a transaction commit before the preceding operations have finished - this restriction may be weakened in the future). As one can run several transactions at once on the same network connection, this is no big restriction, though.

For getting asynchronous execution, PlasmaFS is written on top of a user-space cooperative microthreading library (the Equeue library in Ocamlnet).

PlasmaFS is lock-free. For cooperative microthreading it is not required to protect shared data with locks - they are implicitly protected because the microthreads have full control about the execution flow. Also, in the cases where the semantics implies restrictions of what parallelly executed transaction can do ("transaction isolation"), PlasmaFS prefers to reject operations that would violate the isolation guarantees rather than to wait until the other transaction releases locks. In such cases the error code ECONFLICT is returned to the client, and it is up to the client to resolve the problem. This ensures that the PlasmaFS namenode servers are always responsive and never "lock up".

PlasmaFS uses PostgreSQL to store metadata. This ensures that the metadata is stored in a very reliable way - this is a mature database system that alone already guarantees ACID semantics. For example, it is not possible that the metadata gets corrupted because the disk fills up. Furthermore, PostgreSQL supports two-phase commit. PlasmaFS uses two-phase commit to distribute metadata writes to all namenodes. This guarantees that all namenodes are always in a consistent state, and that in case of a namenode failure, the exact state can be recovered that was seen by the last transaction commit. It is also a prerequisite to allow that namenode slaves can respond to read-only metadata requests (although this is not yet implemented).

PlasmaFS preallocates the data blocks on the datanodes. This allows it that PlasmaFS can plan in advance on which datanodes it stores which blocks, and that there is a very high likelihood that the plan can be executed. It cannot happen that a datanode disk fills up because of another process unrelated to PlasmaFS.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml