Plasma GitLab Archive
Projects Blog Knowledge

Module Mapred_toolkit


module Mapred_toolkit: sig .. end
Map/reduce toolkit


There is a tutorial: Plasmamr_toolkit.

Registered functions



Functions can be registered at program initialization time, and get a unique ID. Later, it is possible to not only call such functions directly, but also remotely on task nodes. Remember that exactly the same executable must be running on the task nodes.

Registered functions are used to name functions that are filled into the placeholders of the map/reduce algorithm scheme (such as map and reduce, but also a few more).

Because the registration must happen at initialization time, it is effectively only possible to register globally defined functions, and not local functions defined inside other functions. (This limitation can currently not removed; a workaround is to pass all data via arguments.)

There is a camlp4 preprocessor helping to define registered functions. Use it like

      let my_function =
        <:rfun< larg1 larg2 ... largM @ rarg1 rarg2 ... rargN -> body >>
    

The "@" and "->" characters need to occur literally here. The function arguments before "@" are local arguments and can be omitted. The arguments after "@" are remote arguments, and at least one of these is mandatory. Remember that there is a local caller, and a task server executing the function. A local argument comes from the caller, and is sent to the task server (using marshalling). The remote arguments are, in contrast, supplied with values from the task server (e.g. a value previously computed in the task server). The type of my_function is something like

      my_function : L1 -> ... -> LM -> (R1 -> ... -> RN -> T) Mapred_rfun.rfun
    

(when the local arguments have types Li and the remote arguments have types Ri).

The camlp4 extension is activated if you compile with

      ocamlfind ocamlc -syntax camlp4o -package mr_framework.toolkit ...
    

(or use directly the preprocessor camlp4 pa_toolkit.cma).

If there are no local arguments, you can also define without camlp4 as

      let my_function =
        Mapred_rfun.register name (fun rarg1 ... rargN -> body)
    

Here, name needs to be a unique identifier for the function. Use Mapred_rfun.apply_partially to get the effect of local arguments.

Registered functions can, as a consequence of the value restriction, only be monomorphic. (The usual workaround of eta-expanding the functions is not applicable here.)

val invoke : ('a -> 'b) Mapred_rfun.rfun -> 'a -> 'b

Formats


type format = [ `Auto_input | `Fixed_size of int | `Line_structured | `Var_size ] 
How a file is split into records. See Plasmamr_file_formats for detailed explanations:

  • `Line_structured: A record is a line terminated by an LF byte
  • `Fixed_size n: A record has exactly a size of n bytes
  • `Var_size: This is a binary format allowing records of variable size
  • `Auto_input: Recognize the format automatically from the file name. If you specify this format, only reading files is supported, and writing files will raise an exception.


Place


module Place: sig .. end

Store


module Store: sig .. end

Sequences


module Seq: sig .. end

Distributed operations on sequences


module DSeq: sig .. end

Job definition


val toolkit_job : Mapred_def.mapred_env -> Mapred_def.mapred_job
This is a generic job definition that must be used together with the distributed algorithms in Mapred_toolkit.DSeq.
class toolkit_job : Mapred_def.mapred_env -> Mapred_def.mapred_job
Same as class
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml