Plasma GitLab Archive
Projects Blog Knowledge

Class type Mapred_def.mapred_job


class type mapred_job = object .. end

method custom_params : string list
The list of allowed custom parameters
method check_config : mapred_env -> mapred_job_config -> unit
Check the config. If not ok, this method can raise exceptions to stop everything
method pre_job_start : mapred_env -> mapred_job_config -> unit
This is run by the job process before the first task is started
method post_job_finish : mapred_env -> mapred_job_config -> unit
This is run by the job process after the last task is finished
method input_record_io : mapred_env ->
mapred_job_config -> Mapred_io.record_rw_factory
How to split the input file into records
method output_record_io : mapred_env ->
mapred_job_config -> Mapred_io.record_rw_factory
How to write the output file from records. This includes all files in the output directory.
method internal_record_io : mapred_env ->
mapred_job_config -> Mapred_io.record_rw_factory
How to represent records for internal files
method map : mapred_env ->
mapred_job_config ->
task_info ->
Mapred_io.record_reader -> Mapred_io.record_writer -> unit
The mapper reads records, maps them, and writes them into a second file.
method sorter : mapred_env ->
mapred_job_config -> float -> sorter
This is normally set to one of the sorters defined in Mapred_sorters. The float is the factor for the sort buffer, and it should be between 0.0 and 1.0.
method extract_key : mapred_env -> mapred_job_config -> string -> int * int
Extracts the key from a record, and returns the position as pair (index,len). Here, index is the byte in the record where the key starts, and len is the length of the key in bytes.

This method is always called by first evaluating let f = job#extract_key me jc, and then calling f line for each input line. Because of this, it is possible to factor initializations out as in

	   method extract_key me jc =
              ...; (* init stuff *)
              (fun line -> ...  (* real extraction *) )

Before Plasma-0.6, extract_key returned the key directly as string.

method partition_of_key : mapred_env ->
mapred_job_config -> string -> int -> int -> int
partition_of_key me jc s p l: Determines the partition of a key (which is supposed to occupy the range p to p+l-1 of s). Can be something simple like (Hashtbl.hash key) mod partitions, or something more elaborated. This method is always called by first evaluating let f = job#partition_of_key me jc, and then calling f s p l for each input line. Because of this, it is possible to factor initializations out as in

	   method partition_of_key me jc =
              ...; (* init stuff *)
              (fun s p l -> ...  (* real extraction *) )

method reduce : mapred_env ->
mapred_job_config ->
task_info ->
Mapred_io.record_reader -> Mapred_io.record_writer -> unit
The reducer reads all the records of one partition, and puts them into an output file.
method combine : mapred_env ->
mapred_job_config ->
task_info ->
(Mapred_io.record_reader -> Mapred_io.record_writer -> unit) option
The optional combiner is called for the internal shuffle passes. The reader gets the already merged input records (i.e. it reads the records in sorted order). The combiner can now shrink the data if possible, and writer them to the writer.

Note that Plasma allows it that the combiner gets data from several partitions!

If no combiner is needed, just define this method as

 method combine _ _ _ = None 

In this case, the internal shuffles just copy the input to the output.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml