Plasma GitLab Archive
Projects Blog Knowledge

Module Mapred_toolkit.DSeq


module DSeq: sig .. end


The following operations are distributed over the task nodes.

It is here required that the underlying stores are files! It is not possible to distribute operations on notebooks.

type config 
val create_config : ?name:string ->
?task_files:string list ->
?bigblock_size:int ->
?map_tasks:int ->
?merge_limit:int ->
?split_limit:int ->
?partitions:int ->
?enhanced_mapping:int ->
?phases:Mapred_def.phases ->
?report:bool ->
?report_to:Netchannels.out_obj_channel ->
?keep_temp_files:bool -> Mapred_def.mapred_env -> config
The default values for the configurations are taken from mapred_env, and are effectively taken from the .conf files. It is possible, though, to override the values.
val get_rc : config -> Mapred_io.record_config
Returns the record config.

Another way for getting the record config is Mapred_def.get_rc.

type 'a result 
The result of a distributed operation
val get_result : 'a result -> 'a
Get the computed value (or an exception)
val stats : 'a result -> Mapred_stats.stats
Get statistics
val job_id : config -> string
Return the job ID
class type mapred_info = object .. end
Access to the job environment and the job configuration
val mapl : (mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
'a Mapred_toolkit.Place.t ->
'b Mapred_toolkit.Place.t ->
config ->
('b, [ `W ]) Mapred_toolkit.Seq.seq list result
mapl m pl_in pl_out conf: Runs a map-only job. This means that the records from pl_in are piped through the function m, and the result is written into new files in pl_out.

The created files are also returned in the output sequences.

val mapl_sort_fold : mapl:(mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
hash:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
cmp:(mapred_info -> 'b -> 'b -> int) Mapred_rfun.rfun ->
initfold:(mapred_info -> int -> 'c) Mapred_rfun.rfun ->
fold:(mapred_info -> 'c -> 'b -> 'c * 'd list)
Mapred_rfun.rfun ->
?finfold:(mapred_info -> 'c -> 'd list) Mapred_rfun.rfun ->
partition_of:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
?initcombine:(mapred_info -> 'e) Mapred_rfun.rfun ->
?combine:(mapred_info -> 'e -> 'b -> 'e * 'b list)
Mapred_rfun.rfun ->
?fincombine:(mapred_info -> 'e -> 'b list)
Mapred_rfun.rfun ->
'a Mapred_toolkit.Place.t ->
'd Mapred_toolkit.Place.t ->
config ->
'b Mapred_toolkit.Place.codec ->
('d, [ `W ]) Mapred_toolkit.Seq.seq list result
mapl_sort_fold <args> pl_in pl_out conf int_codec: This is map/reduce. The records from pl_in are mapped/sorted/reduced and finally written into new files in pl_out. There are a number of named arguments defining the job:

  • mapl maps the elements of the inputs
  • hash returns the hash integer required for sorting (see below)
  • cmp compares two mapped elements
  • initfold initializes a reducer (the int argument is the partition number)
  • fold accu x processes the record x, and returns (accu',out) where out is a list of records to output
  • finfold is called at the end of a reducer
  • partition_of returns the partition number of a mapped record
  • initcombine initializes a combiner
  • combine accu x processes the record x in the combiner, and returns (accu',out) where out is a list of records to output. It is required that initcombine is also set if combine is used.
  • fincombine is called at the end of a combiner
Note that the mapped elements are sorted by first comparing the integers returned by hash, and only if such integers are equal, the two elements are compared in detail by calling cmp. See Mapred_sorters for useful definitions of hash and cmp.

The int_codec is used for representing intermediate files (output of the map phase, and input/output of the shuffle phases).

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml