Class type Mapred_def.mapred_job

class type mapred_job = object .. end

method name : string

A name for identifying this job, e.g. in log files

method input_dir : string

This plasma directory contains the input files

method output_dir : string

This plasma directory will get output files. It should exist, and it should be empty

method work_dir : string

This plasma directory is used for temporary files. It should exist, and it should be empty

method check_config : mapred_env -> unit

Check the config. If not ok, this method can raise exceptions to stop everything

method map : mapred_env ->
       int -> Mapred_io.record_reader -> Mapred_io.record_writer -> unit

The mapper reads records, maps them, and writes them into a second file. The int is the map_id.

method map_tasks : int

The number of map tasks that should not be exceeded. It is tried to hit this number, but it may be possible that not enough map tasks can be generated.

Right now this is also the number of sort tasks. It should be chosen so that every sort can be performed in RAM.

method sort_limit : int64

How big the data chunks are that can be sorted in-memory

method merge_limit : int

How many files are merged at most by a shuffle task

method split_limit : int

How many files are created at most by a shuffle task

method extract_key : mapred_env -> string -> string

Extracts the key from a record

method partitions : int

The number of partitions = number of reduce tasks

method partition_of_key : mapred_env -> string -> int

Determines the partition of a key. Can be something simple like fun k -> (Hashtbl.hash k) mod partitions, or something more elaborated.

method reduce : mapred_env ->
       int -> Mapred_io.record_reader -> Mapred_io.record_writer -> unit

The reducer reads all the records of one partition, and puts them into an output file. The int is the partition.

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge