Plasma_quickstart

Plasma Quickstart

This guide explains how to get a running system as quickly as possible. It also describes which parts of Plasma are needed for which types of applications:

running map/reduce jobs without PlasmaFS
running map/reduce jobs with PlasmaFS
using PlasmaFS as network filesystem

Operating System

So far, Plasma has only been tested on 64 bit Linux. It is not impossible that it runs in 32 bit mode, too, but there might be issues.

Whether it runs on other operating systems is totally unknown. There is a chance it could work on

FreeBSD 9
Open Solaris

but it is totally untested. Other systems will certainly not work.

If you are starting with Plasma, it is not recommended to try one of these untested systems. Stick to 64 bit Linux (all CPUs should work). Of course, if you are an experienced Plasma user, feedback is welcome which problems occur on which OS (and which not).

Build it

Before you can do anything, you need to build Plasma, and install the resulting libraries and binaries.

If you are using a cluster of machines, note that you need to install the libaries and binaries only on a single machine. We call this machine the operator node. Of course, the programs need finally also to be copied over to the other machines, but this process is called deployment, and is supported by different scripts.

Essentially, there are three options for the build:

Download plasma-<version>.tar.gz, configure the tree, build it, and install it. This is (very shortly) described in the file INSTALL included in the tar ball. This method has the advantage that you see what is effectively happening. However, there is the downside that you need a lot of prerequisite software before you can even start. Currently (January 2012), it is not possible to get all prerequisites from anywhere in binary form as downloads (no deb's or rpm's that are recent enough). In short: This is the hard way. Don't do it, unless you want to help developing Plasma.
If you are a happy user of GODI, the popular Ocaml distribution, you find Plasma there as package godi-plasma. Just install it, and you are done. (How to get GODI? See get_godi.html). Well, actually it is not that simple. You also need a few prerequisites, but they are normally available from your Linux distro. This includes a working C compiler, the PostgreSQL database, and a few libraries (in particular the development packages for pcre, and for postgresql (in Debian called libpq)).
If you are lazy, just use this script: plasma_install.sh. It guides you through the build. This is the recommended way, especially when you have never built complex pieces of Ocaml software before. There is a transscript of a sample build here: Sess_plasma_install.

You may ask why there are no "normal" way of getting Plasma, like a deb or rpm package. Plasma simply needs very recent prerequisites, which are not yet available in Linux distros. (Hopefully, this will change.)

The result of the build is that the software is installed under a certain path prefix <prefix>, especially:

<prefix>/bin contains executables
<prefix>/lib/ocaml/pkg-lib contains libraries, especially plasmaclient and mr_framework
<prefix>/doc/godi-plasma contains documentation and examples

When you use the GODI method for the build, there will also be unrelated software installed under <prefix> - this is just a side-effect of the build.

Things you should not do: Do not try to find "abbreviations" for the build. This creates more problems than are solved. For example, don't try to use the ocaml compiler that comes with your Linux distro. Ocaml libraries built with different versions of the compiler cannot be mixed, and attempts to do so lead to checksum mismatches.

What do you need for which application

Trying out map/reduce without PlasmaFS

Since Plasma-0.6, it is possible to run map/reduce jobs without PlasmaFS. The data files are just stored in the local Unix filesystem. Of course, you are then restricted to just a single computer. This mode especially exists for trying out map/reduce for the first time.

So, if this applies to you, you can skip the PlasmaFS deployment.

Remember that the map/reduce configuration file must explicitly disable PlasmaFS. E.g. if your map/reduce program is called my_prog, there is a configuration file my_prog.conf, and it must conform to:

netplex {
  namenodes {
    disabled = true;                       (* required *)
  };
  mapred {
    node { addr = "localhost" };           (* only one node "localhost" *)
    ...                                    (* other settings *)
  };
  mapredjob {
    ...                                    (* other settings *)
  };
}

Caveat: There are many configuration files that look similar. We refer here to the file configuring the map/reduce job.

Read more about map/reduce in these two documents:

Plasmamr_howto: Explains how to run a job in classic mode.
Plasmamr_toolkit: The advanced functional toolkit. Recommended for real FP programmers and type enthusiasts.

Using map/reduce with PlasmaFS

In this case, you should read the instructions in Plasmafs_deployment. In short, you need

a few computers (not too small) connected in a LAN,
disk space on these computers,
at least one node should have a running PostgreSQL database (with a few special configurations),
ssh access to all these computers from the operator node (preferrably without password).

The deployment document explains this in detail. Note that you do not need to configure NFS support in PlasmaFS for just running map/reduce.

For running a map/reduce job, you need to know two PlasmaFS settings:

the name of the PlasmaFS cluster, and
the machine and port where the PlasmaFS namenode is running (the computer with the PostgreSQL database)

The map/reduce configuration file must then look like:

netplex {
  namenodes {
    clustername = "the name of the PlasmaFS cluster";
    node { addr = "namenode host:namenode port" };
  };
  mapred {
    ...                                    (* other settings *)
  };
  mapredjob {
    ...                                    (* other settings *)
  };
}

It is not necessary to configure anything on the computers running map/reduce tasks. They will automatically get the required settings together with the other task parameters.

Using PlasmaFS as network filesystem

This application allows you to store large files in a replicated way. Also, PlasmaFS is, to some degree, fault-tolerant, and gets you close to high availability. Finally, PlasmaFS can be configured to be highly secure.

This case is very similar to the previous application: read the instructions in Plasmafs_deployment.

Remember that there are several ways of accessing PlasmaFS:

From the shell, use the plasma utility (see Cmd_plasma).
From Ocaml code, use the plasmaclient library, and especially the Plasma_client module in it
A slightly simpler (but also less complete) alternative is Plasma_netfs
From anywhere, access PlasmaFS via the NFS bridge

The first three options use the PlasmaFS protocol to talk to the server nodes. In order to get access from a machine to the cluster, you need to install two things on this machine:

The authentication daemon must run there. In Plasmafs_deployment it is explained that there is a file authnode.hosts where you can add the host names of all machines running this daemon.
You need the clustername and at least the host name and port of one live namenode in order to connect. You can simply write these settings into the file ~/.plasmafs (explained in Plasma_client_config).

The NFS bridge makes it even simpler to access the PlasmaFS files: You can simply mount the filesystem and use normal file access functions. You can read how to do this here: Plasmafs_nfs.

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge