Plasma GitLab Archive
Projects Blog Knowledge

Module Mapred_streaming

module Mapred_streaming: sig .. end
Support for streaming

Streaming means that the task server does not execute the tasks internally, but starts subprocesses for this purpose. These processes can read stdin to get the input data, and have to write output data to stdout.

The following additional job configs are interpreted:

  • map_exec: The command to execute for mapping. This command is run on the task node.
  • reduce_exec: The command to execute for reducing. This command is run on the task node.
  • extract_mode: How to split a line into keys and values. This mode is only applied to lines written by the map command. Possible values are:
    • key: The whole line is taken as key. This is the default.
    • key_tab_value: The field before the first TAB is taken as key, and the rest of the line as value.
    • key_tab_partition_tab_value: The field before the first TAB is taken as key. The field between the first and the second TAB is taken as partition number (decimal number). The rest of the line is taken as value.

The job config task_files is very useful to install the executable for the map and reduce commands on the task nodes. E.g.:

       task_files = "my_command";
       map_exec = "./my_command -map arg1 arg2 ...";
       reduce_exec = "./my_command -reduce arg1 arg2 ...";

The working directory when starting the command is exactly the directory where the files are installed by the task_files directive.

The following environment variables are also set:

  • PLASMAMR_LOCAL_DIR: The local directory
  • PLASMAMR_LOCAL_LOG_DIR: The log directory. Files whose names begin with PLASMAMR_TASK_PREFIX are immediately moved to the PlasmaFS log directory when the task is finished. Files with other names are also moved, but first when the job finishes, because it cannot be tracked which task created them.
  • PLASMAMR_REQ_ID: The request ID of the task
  • PLASMAMR_PARTITION: The partition (only reduce)
  • PLASMAMR_NAME: The job name
  • PLASMAMR_INPUT_DIR: The input directory in PlasmaFS
  • PLASMAMR_OUTPUT_DIR: The output directory in PlasmaFS
  • PLASMAMR_WORK_DIR: The work directory in PlasmaFS
  • PLASMAMR_LOG_DIR: The log directory in PlasmaFS
  • PLASMAMR_BIGBLOCK_SIZE: The size of bigblocks
  • PLASMAMR_PARTITIONS: The number of partitions
  • PLASMAMR_CONF: The task server configuration file (use this file to restore Mapred_config.mapred_config fully if needed). The job-specific settings from Mapred_def.mapred_job_config cannot be retrieved from here, though.
  • PLASMAFS_CLUSTER: The name of the PlasmaFS cluster
  • PLASMAFS_NAMENODES: The list of namenodes
Stderr is redirected to a log file.
val job : unit -> Mapred_def.mapred_job
The streaming job.

The Plasma distribution comes already with a program that runs this job via Mapred_main: mr_streaming

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml