module Mapred_streaming:Support for streaming
The following additional job configs are interpreted:
map_exec: The command to execute for mapping. This command is run on the task node.
reduce_exec: The command to execute for reducing. This command is run on the task node.
extract_mode: How to split a line into keys and values. This mode is only applied to lines written by the map command. Possible values are:
key: The whole line is taken as key. This is the default.
key_tab_value: The field before the first TAB is taken as key, and the rest of the line as value.
key_tab_partition_tab_value: The field before the first TAB is taken as key. The field between the first and the second TAB is taken as partition number (decimal number). The rest of the line is taken as value.
The job config
task_files is very useful to install the executable
for the map and reduce commands on the task nodes. E.g.:
task_files = "my_command"; map_exec = "./my_command -map arg1 arg2 ..."; reduce_exec = "./my_command -reduce arg1 arg2 ...";
The working directory when starting the command is exactly the
directory where the files are installed by the
The following environment variables are also set:
PLASMAMR_LOCAL_DIR: The local directory
PLASMAMR_LOCAL_LOG_DIR: The log directory. Files whose names begin with
PLASMAMR_TASK_PREFIXare immediately moved to the PlasmaFS log directory when the task is finished. Files with other names are also moved, but first when the job finishes, because it cannot be tracked which task created them.
PLASMAMR_REQ_ID: The request ID of the task
PLASMAMR_PARTITION: The partition (only reduce)
PLASMAMR_NAME: The job name
PLASMAMR_JOB_ID: The job ID
PLASMAMR_INPUT_DIR: The input directory in PlasmaFS
PLASMAMR_OUTPUT_DIR: The output directory in PlasmaFS
PLASMAMR_WORK_DIR: The work directory in PlasmaFS
PLASMAMR_LOG_DIR: The log directory in PlasmaFS
PLASMAMR_BIGBLOCK_SIZE: The size of bigblocks
PLASMAMR_PARTITIONS: The number of partitions
PLASMAMR_CONF: The task server configuration file (use this file to restore
Mapred_config.mapred_configfully if needed). The job-specific settings from
Mapred_def.mapred_job_configcannot be retrieved from here, though.
PLASMAFS_CLUSTER: The name of the PlasmaFS cluster
PLASMAFS_NAMENODES: The list of namenodes
val job :
unit -> Mapred_def.mapred_job
The Plasma distribution comes already with a program that runs this