class type mapred_job =object
..end
method custom_params : string list
method check_config : mapred_env -> mapred_job_config -> unit
method pre_job_start : mapred_env -> mapred_job_config -> unit
method post_job_finish : mapred_env -> mapred_job_config -> unit
method input_record_io : mapred_env ->
mapred_job_config -> Mapred_io.record_rw_factory
method output_record_io : mapred_env ->
mapred_job_config -> Mapred_io.record_rw_factory
method internal_record_io : mapred_env ->
mapred_job_config -> Mapred_io.record_rw_factory
method map : mapred_env ->
mapred_job_config ->
task_info ->
Mapred_io.record_reader -> Mapred_io.record_writer -> unit
method sorter : mapred_env ->
mapred_job_config -> float -> sorter
Mapred_sorters
. The float is the factor for the sort buffer,
and it should be between 0.0 and 1.0.method extract_key : mapred_env -> mapred_job_config -> string -> int * int
(index,len)
. Here, index
is the byte in the record
where the key starts, and len
is the length of the key in bytes.
This method is always called by
first evaluating let f = job#extract_key me jc
, and then
calling f line
for each input line. Because of this, it is
possible to factor initializations out as in
method extract_key me jc =
...; (* init stuff *)
(fun line -> ... (* real extraction *) )
Before Plasma-0.6, extract_key
returned the key directly as
string.
method partition_of_key : mapred_env ->
mapred_job_config -> string -> int -> int -> int
partition_of_key me jc s p l
:
Determines the partition of a key (which is supposed to occupy the
range p
to p+l-1
of s
). Can be something simple like
(Hashtbl.hash key) mod partitions
, or something more
elaborated. This method is always called by
first evaluating let f = job#partition_of_key me jc
, and then
calling f s p l
for each input line. Because of this, it is
possible to factor initializations out as in
method partition_of_key me jc =
...; (* init stuff *)
(fun s p l -> ... (* real extraction *) )
method reduce : mapred_env ->
mapred_job_config ->
task_info ->
Mapred_io.record_reader -> Mapred_io.record_writer -> unit
method combine : mapred_env ->
mapred_job_config ->
task_info ->
(Mapred_io.record_reader -> Mapred_io.record_writer -> unit) option
Note that Plasma allows it that the combiner gets data from several partitions!
If no combiner is needed, just define this method as
method combine _ _ _ = None
In this case, the internal shuffles just copy the input to the
output.