Module Mapred_toolkit.DSeq

module DSeq: sig .. end

The following operations are distributed over the task nodes.

It is here required that the underlying stores are files! It is not possible to distribute operations on notebooks.

type config

val create_config : ?name:string ->
       ?task_files:string list ->
       ?bigblock_size:int ->
       ?map_tasks:int ->
       ?merge_limit:int ->
       ?split_limit:int ->
       ?partitions:int ->
       ?enhanced_mapping:int ->
       ?phases:Mapred_def.phases ->
       ?report:bool ->
       ?report_to:Netchannels.out_obj_channel ->
       ?keep_temp_files:bool -> Mapred_def.mapred_env -> config

The default values for the configurations are taken from mapred_env, and are effectively taken from the .conf files. It is possible, though, to override the values.

val get_rc : config -> Mapred_io.record_config

Returns the record config.

Another way for getting the record config is Mapred_def.get_rc.

type 'a result

The result of a distributed operation

val get_result : 'a result -> 'a

Get the computed value (or an exception)

val stats : 'a result -> Mapred_stats.stats

Get statistics

val job_id : config -> string

Return the job ID

class type mapred_info = object .. end

Access to the job environment and the job configuration

val mapl : (mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
       'a Mapred_toolkit.Place.t ->
       'b Mapred_toolkit.Place.t ->
       config ->
       ('b, [ `W ]) Mapred_toolkit.Seq.seq list result

mapl m pl_in pl_out conf: Runs a map-only job. This means that the records from pl_in are piped through the function m, and the result is written into new files in pl_out.

The created files are also returned in the output sequences.

val mapl_sort_fold : mapl:(mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
       hash:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
       cmp:(mapred_info -> 'b -> 'b -> int) Mapred_rfun.rfun ->
       initfold:(mapred_info -> int -> 'c) Mapred_rfun.rfun ->
       fold:(mapred_info -> 'c -> 'b -> 'c * 'd list)
            Mapred_rfun.rfun ->
       ?finfold:(mapred_info -> 'c -> 'd list) Mapred_rfun.rfun ->
       partition_of:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
       ?initcombine:(mapred_info -> 'e) Mapred_rfun.rfun ->
       ?combine:(mapred_info -> 'e -> 'b -> 'e * 'b list)
                Mapred_rfun.rfun ->
       ?fincombine:(mapred_info -> 'e -> 'b list)
                   Mapred_rfun.rfun ->
       'a Mapred_toolkit.Place.t ->
       'd Mapred_toolkit.Place.t ->
       config ->
       'b Mapred_toolkit.Place.codec ->
       ('d, [ `W ]) Mapred_toolkit.Seq.seq list result

mapl_sort_fold <args> pl_in pl_out conf int_codec: This is map/reduce. The records from pl_in are mapped/sorted/reduced and finally written into new files in pl_out. There are a number of named arguments defining the job:

mapl maps the elements of the inputs
hash returns the hash integer required for sorting (see below)
cmp compares two mapped elements
initfold initializes a reducer (the int argument is the partition number)
fold accu x processes the record x, and returns (accu',out) where out is a list of records to output
finfold is called at the end of a reducer
partition_of returns the partition number of a mapped record
initcombine initializes a combiner
combine accu x processes the record x in the combiner, and returns (accu',out) where out is a list of records to output. It is required that initcombine is also set if combine is used.
fincombine is called at the end of a combiner

Note that the mapped elements are sorted by first comparing the integers returned by hash, and only if such integers are equal, the two elements are compared in detail by calling cmp. See Mapred_sorters for useful definitions of hash and cmp.

The int_codec is used for representing intermediate files (output of the map phase, and input/output of the shuffle phases).

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge