module Mapred_streaming:sig
..end
The following additional job configs are interpreted:
map_exec
: The command to execute for mapping. This command is
run on the task node.reduce_exec
: The command to execute for reducing. This command is
run on the task node.extract_mode
: How to split a line into keys and values. This
mode is only applied to lines written by the map command.
Possible values are:
key
: The whole line is taken as key. This is the default.key_tab_value
: The field before the first TAB is taken as key,
and the rest of the line as value.key_tab_partition_tab_value
: The field before the first TAB
is taken as key. The field between the first and the second TAB
is taken as partition number (decimal number). The rest of the
line is taken as value.
The job config task_files
is very useful to install the executable
for the map and reduce commands on the task nodes. E.g.:
task_files = "my_command";
map_exec = "./my_command -map arg1 arg2 ...";
reduce_exec = "./my_command -reduce arg1 arg2 ...";
The working directory when starting the command is exactly the
directory where the files are installed by the task_files
directive.
The following environment variables are also set:
PLASMAMR_LOCAL_DIR
: The local directoryPLASMAMR_LOCAL_LOG_DIR
: The log directory. Files whose names begin with
PLASMAMR_TASK_PREFIX
are immediately moved to the PlasmaFS log
directory when the task is finished. Files with other names are also
moved, but first when the job finishes, because it cannot be tracked
which task created them.PLASMAMR_REQ_ID
: The request ID of the taskPLASMAMR_PARTITION
: The partition (only reduce)PLASMAMR_NAME
: The job namePLASMAMR_JOB_ID
: The job IDPLASMAMR_INPUT_DIR
: The input directory in PlasmaFSPLASMAMR_OUTPUT_DIR
: The output directory in PlasmaFSPLASMAMR_WORK_DIR
: The work directory in PlasmaFSPLASMAMR_LOG_DIR
: The log directory in PlasmaFSPLASMAMR_BIGBLOCK_SIZE
: The size of bigblocksPLASMAMR_PARTITIONS
: The number of partitionsPLASMAMR_CONF
: The task server configuration file (use this file
to restore Mapred_config.mapred_config
fully if needed). The
job-specific settings from Mapred_def.mapred_job_config
cannot
be retrieved from here, though.PLASMAFS_CLUSTER
: The name of the PlasmaFS clusterPLASMAFS_NAMENODES
: The list of namenodesval job : unit -> Mapred_def.mapred_job
The Plasma distribution comes already with a program that runs this
job via Mapred_main
: mr_streaming