module DSeq:sig
..end
It is here required that the underlying stores are files!
It is not possible to distribute operations on notebooks.
type
config
val create_config : ?name:string ->
?task_files:string list ->
?bigblock_size:int ->
?map_tasks:int ->
?merge_limit:int ->
?split_limit:int ->
?partitions:int ->
?enhanced_mapping:int ->
?phases:Mapred_def.phases ->
?report:bool ->
?report_to:Netchannels.out_obj_channel ->
?keep_temp_files:bool -> Mapred_def.mapred_env -> config
mapred_env
,
and are effectively taken from the .conf
files. It is possible, though,
to override the values.val get_rc : config -> Mapred_io.record_config
Another way for getting the record config is Mapred_def.get_rc
.
type 'a
result
val get_result : 'a result -> 'a
val stats : 'a result -> Mapred_stats.stats
val job_id : config -> string
class type mapred_info =object
..end
val mapl : (mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
'a Mapred_toolkit.Place.t ->
'b Mapred_toolkit.Place.t ->
config ->
('b, [ `W ]) Mapred_toolkit.Seq.seq list result
mapl m pl_in pl_out conf
: Runs a map-only job. This means that
the records from pl_in
are piped through the function m
, and
the result is written into new files in pl_out
.
The created files are also returned in the output sequences.
val mapl_sort_fold : mapl:(mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
hash:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
cmp:(mapred_info -> 'b -> 'b -> int) Mapred_rfun.rfun ->
initfold:(mapred_info -> int -> 'c) Mapred_rfun.rfun ->
fold:(mapred_info -> 'c -> 'b -> 'c * 'd list)
Mapred_rfun.rfun ->
?finfold:(mapred_info -> 'c -> 'd list) Mapred_rfun.rfun ->
partition_of:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
?initcombine:(mapred_info -> 'e) Mapred_rfun.rfun ->
?combine:(mapred_info -> 'e -> 'b -> 'e * 'b list)
Mapred_rfun.rfun ->
?fincombine:(mapred_info -> 'e -> 'b list)
Mapred_rfun.rfun ->
'a Mapred_toolkit.Place.t ->
'd Mapred_toolkit.Place.t ->
config ->
'b Mapred_toolkit.Place.codec ->
('d, [ `W ]) Mapred_toolkit.Seq.seq list result
mapl_sort_fold <args> pl_in pl_out conf int_codec
: This is
map/reduce. The records from pl_in
are mapped/sorted/reduced
and finally written into new files in pl_out
. There are a
number of named arguments defining the job:
mapl
maps the elements of the inputshash
returns the hash integer required for sorting (see below)cmp
compares two mapped elementsinitfold
initializes a reducer (the int
argument is the
partition number)fold accu x
processes the record x
, and returns (accu',out)
where out
is a list of records to outputfinfold
is called at the end of a reducerpartition_of
returns the partition number of a mapped recordinitcombine
initializes a combinercombine accu x
processes the record x
in the combiner, and
returns (accu',out)
where out
is a list of records to output.
It is required that initcombine
is also set if combine
is
used.fincombine
is called at the end of a combinerhash
, and only if such integers are equal,
the two elements are compared in detail by calling cmp
.
See Mapred_sorters
for useful definitions of hash
and
cmp
.
The int_codec
is used for representing intermediate files
(output of the map phase, and input/output of the shuffle phases).