Hadoop is a popular map/reduce platform written in Java.
Hadoop is restricted to a certain algorithmic scheme: After map
, the
data is immediately fully partitioned, and each partition is then
processed independently. A similar scheme can be configured in Plasma
by setting enhanced_mapping
to the number of partitions. This scheme
is problematic when the size of the partitions is very different. In
contrast, Plasma provides more flexibility for selecting the layout of
the scheme. In particular, the division of the data into partitions
can be delayed until data files are merged. This is especially
advantageous for higher numbers of partitions.
Feature PlasmaMR Hadoop
---------------------------------------------------------------------------
Support for practically
unlimited data volume yes yes
Tasks use node-local temp files
no (except log files) yes (files are written
to local file system)
Tasks use node-local DFS blocks
yes (temp data are no
written to DFS so that
blocks are stored on the
same node)
Jobs can crash when partitions
become too large no yes
Jobs can crash when too many
datanodes fill up to 100% no yes
Support for repeating tasks on
different nodes after
a crash not yet yes
(but planned)
(currently a crashing
task crashes the job)
Support for small blocksizes yes (tasks can process no
compressed block lists)
Tasks are run on the node with
highest data locality yes yes
Support for compound map yes no
tasks reading from unrelated
file blocks, thus keeping the
number of map tasks low
Speculative task execution not yet yes
(but planned)
Support for "racks" (groups of
machines at the same switch) no yes
Fast job startup less than a second no
Primary API Ocaml Java
Streaming interface yes yes
(usable from any language)
Supported DFS PlasmaFS only HDFS and a few more
Counters no yes
Web interface no yes
Queueing of jobs no yes
Log files are written to DFS yes no