Plasma GitLab Archive
Projects Blog Knowledge

Plasmafs_deployment



Deploying PlasmaFS

This document assumes that PlasmaFS is already built and installed. It explains how to deploy it on the various nodes that are part of a PlasmaFS cluster.

Planning

Minimum

A PlasmaFS cluster needs at least one namenode and one datanode. For getting started it is possible to put both kinds of servers on the same machine.

The smallest reasonable blocksize is 64K. There is already a measurable improvement of speed for blocksizes of 1M (better ratio of metadata operations per amount of accessed data). Blocksizes beyond 64M may be broken in the current code base.

Big blocksizes also mean big buffers. For a blocksize of 64M an application consumes already 6.4G of memory for buffers if it "only" runs 10 processes where each process access 10 PlasmaFS files at the same time.

If NFS write performce is important, one should not configure blocksizes bigger than the NFS clients can support on the wire. For example, Linux supports up to 1M in standard kernels. It is better to stay below this value.

If you are undecided, choose 1M as blocksize. Good compromise.

Namenode considerations

The number of namenodes can reasonably be increased to two or three, but it is unwise to go beyond this number. The more namenodes are included in the system the more complicated the commit operation becomes. At a certain point, there is no additional safety from adding more namenodes.

The namenode database is best put on RAID 1 or RAID 10 arrays. Although there are replicas, there is right now no way to repair a damaged namenode db without stopping the whole PlasmaFS system.

Datanode considerations

The number of datanodes is theoretically unlimited. For now it is unwise to add more datanodes than can be connected to the same switch, because there is no notion of network distance in PlasmaFS yet.

There is the limitation that there can be only one datanode identity per machine. Because of this it makes sense to create one big single data volume from the available disks (using RAID 0, or JBOD as provided by volume managers like LVM), and to put the block data onto this volume.

It is advisable to use an extent-based filesystem for this volume, such as XFS. However, this is no requirement - any filesystem with Unix semantics will do.

Raw partitions are not supported, and probably will never be.

Network

PlasmaFS has been developed for Gigabit networks. One should prefer switches with a high bandwidth (there is often a limit of the total bandwidth a switch can handle before data packets are dropped). Ideally, all ports of the switch can simultaneously be run at full speed (e.g. 32 Gbit/s for a 16 port switch). SOHO switches often do not deliver this!

As of now, there is no assumption that the nodes are in the same network segment (no broadcasts). Future versions of PlasmaFS might add optional multicasting-based features, and routers must then be configured for routing multicast traffic.

It is assumed that hostnames resolve to IP addresses. It is allowed that the hostname resolves to a loopback IP address on the local machine. PlasmaFS never tries to perform reverse lookups.

PlasmaFS avoids DNS lookups as much as possible. However, when PlasmaFS clients start up, DNS lookups are unavoidable. A well performing DNS infrastructure is advisable. Actually, any alternate directory system can also be used, because PlasmaFS only uses the system resolver for looking up names. Even /etc/hosts is ok.

Starting point

The machine where PlasmaFS was built and installed is called the operator node in the following text. From the operator node the software is deployed. It is easy to use one of the namenodes or datanodes as operator node.

We assume all daemons on all machines are running as the same user. This user should not be root! It is also required that the nodes can be reached via ssh from the operator node, and that ssh logins are possible without password.

On the namenodes there must be PostgreSQL running (version 8.2 or better). It must be configured as follows:

  • Network logins are not required. Namenodes only access PostgreSQL databases on the same machine. For these local-only logins, it is not required to set passwords. A pg_hba.conf line of
     local all all ident 
    will do the trick (often there by default). This permits logins if the hostname in connection configs is omitted.
  • You have to add the user to PostgreSQL PlasmaFS is running as. Become the user postgres, and run:
     createuser <username> 
    This will ask a few questions. The user must be able to create databases. Other privileges are not required.
  • Test the user setup by opening a local PostgreSQL connection with
     psql template1 </dev/null
    If it does not ask for a password, and if it does not emit an error, everything is fine.
  • Some PostgreSQL installations set the number of prepared transactions to zero by default (this is about two-phase commit). This is too low. For each PlasmaFS instance accessing a PostgreSQL server there must be a buffer for prepared transactions (i.e. normally at least one). You can change that by setting the parameter max_prepared_transaction in postgresql.conf to a positive value. Restart PostgreSQL.
Further remarks:
  • There is no need to install Ocaml on the cluster nodes. Ocaml executables are self-containing except that they dynamically link system libraries.
  • Of course, the required system libraries must also exist on the cluster nodes. This is especially libpcre and the PostgreSQL client library libpq.
The remaining instructions on this page use the scripts in the clusterconfig directory. This is installed together with the other software. It is advisable not to change this version of clusterconfig directly, but to copy it to somewhere else, and to run the scripts on the copy.

Overview over the deployment

The clusterconfig directory contains a few scripts, and a directory instances. In instances there is a subdirectory template with some files. So this looks like

  • clusterconfig: The scripts available here are run on the operator node to control the whole cluster
  • clusterconfig/instances: This directory contains all the instances that are controlled from here, and templates for creating new instances. Actually, any instance can also be used as template - when being instantiated, the template is simply copied.
  • clusterconfig/instances/template: This is the default template. It contains recommended starting points for config files.
A deployment works now by creating a new instance (i.e. by copying the template), and then adapting the instance configuration by editing the copied config files directly. The instance can then be distributed to the other nodes of the cluster, and there are scripts for initializing, starting, and stopping PlasmaFS services.

Creating an instance

Create a new instance with name inst with

./new_inst.sh <inst> <prefix> <blocksize>

The prefix is the absolute path on the cluster nodes where the PlasmaFS software is to be installed. The deploy_inst.sh script (see below) will create a directory hierarchy:

  • <prefix>/bin: for binaries
  • <prefix>/etc: config files and rc scripts
  • <prefix>/log: for log files
  • <prefix>/data: data files (datanodes only)
The deploy_inst.sh script will create these directories only if they do not exist yet, either as directories or symlinks to directories.

The blocksize can be given in bytes, or as a number with suffix "K" or "M".

Optionally, new_inst.sh may be directed to use an alternate template:

./new_inst.sh -template <templ> <inst> <prefix> <blocksize>

templ can be another instance that is to be copied instead of template.

Configuring an instance

After new_inst.sh there is a new directory <inst> in instances. Go there and edit the files:

  • namenode.hosts: Put here the hostnames of the namenodes
  • datanode.hosts: Put here the hostnames of the datanodes
  • nfsnode.hosts: Put here the hostnames of the nodes that will run NFS bridges (may remain empty)
You may also edit the other configuration files, but this is not required.

Deploying an instance

With

./deploy_inst.sh <inst>

one can install the files under prefix on the cluster nodes. The installation procedure uses ssh to copy the files.

The option -only-config restricts the copy to configuration files and scripts, but omits executables. This is useful when the configuration of a running cluster needs to be changed (when running, the executable files cannot be overwritten).

Initializing the namenodes

This step creates the PostgreSQL databases on the namenodes:

./initdb_inst.sh <inst>

If the databases already exist, this step will fail. Use the switch -drop to delete databases that were created in error.

The databases have the name plasma_<inst>.

After initializing the namenodes, it is possible to start the namenode server with

./rc_nn.sh start <inst>

This must be done before continuing with the initialization of the datanodes.

Initializing the datanodes

This step creates the data area for the block files on the datanodes:

./initdn_inst.sh <inst> <size> all

Here, size is the size of the data area in bytes. size can be followed by "K", "M", or "G". If size is not a multiple of the blocksize, it is rounded down to the next lower multiple.

The keyword "all" means that all datanodes are initialized with the same size. Alternately, one can also initialize the nodes differently, e.g.

./initdn_inst.sh inst1 100G m1 m2 m3
./initdn_inst.sh inst1 200G m4 m5 m6

This would initialize the hosts m1, m2, and m3 with a data area of 100G, and m4, m5, and m6 with a data area of 200G.

The initdn_inst.sh script also starts the datanode server on the hosts, and registers the datanodes with the namenode.

Starting and stopping the cluster

You may have noticed that during initialization the cluster nodes were started in this order:

  1. namenodes, collectively
  2. datanodes, one after the other
This is not the usual order! Normally, the datanodes are started first, and the namenodes are started after that. This is because the namenodes try to contact the datanodes when the namenodes are started, and set the state of the datanode identity to "enabled and connected" when the datanode is alive. At initialization time, however, there is no such discovery of datanodes, as the datanodes are added manually to the namenode servers.

There are scripts rc_dn.sh, rc_nn.sh, and rc_nfsd.sh to start and stop datanodes, namenodes, and NFS bridges on the whole cluster. These scripts take the host names of the nodes to ssh to from the configured *.hosts files.

So the right order of startup is:

  1. datanodes, one after the other
  2. namenodes, collectively
  3. NFS bridges
For shutdown, the reverse order is best. There is rc_all.sh to start/stop all configured services in one go.

What is a "collective" startup? As the namenode servers elect the coordinator at startup, they need to be started within a short time period (60 seconds).

Starting and stopping individual nodes

On the cluster nodes, there are also scripts rc_dn.sh, rc_nn.sh, and rc_nfsd.sh in <prefix>/etc. Actually, the scripts in clusterconfig on the operator node call the scripts with the same name on the cluster nodes to perform their tasks. One can also call the scripts on the cluster nodes directly to start/stop selectively a single server, e.g.

ssh m3 <prefix>/etc/rc_dn.sh stop

would stop the datanode server on m3.

The distribution does not contain a script that would be directly suited for inclusion into the system boot. However, the existing rc scripts could be called by a system boot script. (Especially, one would need to switch to the user PlasmaFS is running as before one can call the rc scripts.)

Testing with the plasma client

One can use the plasma client for testing the system, e.g.

plasma -namenode m1:2730 -cluster inst1 list /

would list the contents of / of PlasmaFS.

Admin tasks not covered here

  • How to add another namenode
  • How to add another datanode
  • How to move a namenode to a different machine
  • How to move a datanode to a different machine
  • How to replace damaged namenodes

Admin tasks not yet supported at all

  • get usage statistics per datanode
  • rebalance a cluster
  • replace damaged datanodes (no way yet to create missing replicas)

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml