Plasmamr_and_hadoop

Feature comparison of PlasmaMR and Hadoop

Hadoop is a popular map/reduce platform written in Java.

Hadoop is restricted to a certain algorithmic scheme: After map, the data is immediately fully partitioned, and each partition is then processed independently. A similar scheme can be configured in Plasma by setting enhanced_mapping to the number of partitions. This scheme is problematic when the size of the partitions is very different. In contrast, Plasma provides more flexibility for selecting the layout of the scheme. In particular, the division of the data into partitions can be delayed until data files are merged. This is especially advantageous for higher numbers of partitions.

Feature				PlasmaMR		Hadoop
---------------------------------------------------------------------------
Support for practically
  unlimited data volume		yes			yes

Tasks use node-local temp files
      	  	     	  	no (except log files)	yes (files are written
				   	       		to local file system)

Tasks use node-local DFS blocks	
      	  	     	 	yes (temp data are 	no
				written to DFS so that
				blocks are stored on the
				same node)

Jobs can crash when partitions
  become too large		no			yes

Jobs can crash when too many
  datanodes fill up to 100%	no			yes

Support for repeating tasks on
  different nodes after
  a crash			not yet			yes
    				(but planned)
				(currently a crashing
				task crashes the job)

Support for small blocksizes	yes (tasks can process	no
	    	  		compressed block lists)

Tasks are run on the node with
  highest data locality		yes			yes

Support for compound map	yes			no
  tasks reading from unrelated
  file blocks, thus keeping the
  number of map tasks low

Speculative task execution	not yet			yes
	    	 		(but planned)

Support for "racks" (groups of
  machines at the same switch)	no			yes

Fast job startup       		less than a second	no

Primary API 	       		Ocaml			Java

Streaming interface		yes			yes
  (usable from any language)

Supported DFS			PlasmaFS only		HDFS and a few more

Counters  			no			yes

Web interface			no			yes

Queueing of jobs		no			yes

Log files are written to DFS	yes			no

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge