Plasma GitLab Archive
Projects Blog Knowledge

Class type Mapred_def.mapred_job


class type mapred_job = object .. end

method name : string
A name for identifying this job, e.g. in log files
method input_dir : string
This plasma directory contains the input files
method output_dir : string
This plasma directory will get output files. It should exist, and it should be empty
method work_dir : string
This plasma directory is used for temporary files. It should exist, and it should be empty
method check_config : mapred_env -> unit
Check the config. If not ok, this method can raise exceptions to stop everything
method map : mapred_env ->
int -> Mapred_io.record_reader -> Mapred_io.record_writer -> unit
The mapper reads records, maps them, and writes them into a second file. The int is the map_id.
method map_tasks : int
The number of map tasks that should not be exceeded. It is tried to hit this number, but it may be possible that not enough map tasks can be generated.

Right now this is also the number of sort tasks. It should be chosen so that every sort can be performed in RAM.

method sort_limit : int64
How big the data chunks are that can be sorted in-memory
method merge_limit : int
How many files are merged at most by a shuffle task
method split_limit : int
How many files are created at most by a shuffle task
method extract_key : mapred_env -> string -> string
Extracts the key from a record
method partitions : int
The number of partitions = number of reduce tasks
method partition_of_key : mapred_env -> string -> int
Determines the partition of a key. Can be something simple like fun k -> (Hashtbl.hash k) mod partitions, or something more elaborated.
method reduce : mapred_env ->
int -> Mapred_io.record_reader -> Mapred_io.record_writer -> unit
The reducer reads all the records of one partition, and puts them into an output file. The int is the partition.
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml