Plasma GitLab Archive
Projects Blog Knowledge

Module Mapred_toolkit.Seq


module Seq: sig .. end


A sequence is a view on a store. It is possible to iterate over the elements of the sequence in normal order. One can also add new elements to the end of a sequence, and get a new sequence where the new elements are found at the end.

A sequence is always bound to a store. The store is the entity which is modified by sequence operations. The sequence as such behaves as if it were immutable (like a list).

All the following operations are only executed locally (no distribution).

type ('a, 'b) seq 
A sequence (el,mode) seq contains elements of type el, and can be accessed according to mode:
  • mode=[`R]: the sequence is read-only
  • mode=[`W]: the sequence is write-only
  • mode=[`R|`W]: the sequence is read-write

exception End_of_sequence
val read : 'a Mapred_toolkit.Store.store -> ('a, [ `R ]) seq
Returns the sequence view such that the first element of the sequence is the first element of the store, i.e. reads the store from the beginning.

This is not allowed for write-only stores!

val extend : 'a Mapred_toolkit.Store.store -> ('a, [ `W ]) seq
Returns the sequence view on the end of the store. The returned seq is always empty. By adding new elements to the sequence, new elements are added to the store, though.
val reread : ('a, 'b) seq -> ('a, [ `R ]) seq
reads the elements from the store obtained by calling Store.read
val store : ('a, 'b) seq -> 'a Mapred_toolkit.Store.store
The underlying store of the sequence
val notebook : unit -> ('a, [ `R | `W ]) seq
creates an in-memory sequence (not linked to a file)
val is_empty : ('a, [> `R ]) seq -> bool
tests whether the sequence is empty
val hd : ('a, [> `R ]) seq -> 'a
returns the head element of the sequence
val tl : ('a, [> `R ] as 'b) seq -> ('a, 'b) seq
returns the tail of the sequence

hd and tl raise End_of_sequence if applied to an empty sequence
val add : 'a ->
('a, [> `W ] as 'b) seq -> ('a, 'b) seq
let s2 = add x s1: Adds the element to the underlying store, and returns a new sequence where the new element is visible at the end. This means the last element of s2 is now x. The view provided by s1 does not change.

It is a requirement that the last element of s1 is the last element of the underlying store. Otherwise the add fails.

val addl : 'a list ->
('a, [> `W ] as 'b) seq -> ('a, 'b) seq
adds several elements
val append : ('a, [> `W ] as 'b) seq ->
('a, [> `R ]) seq -> ('a, 'b) seq
let s3 = append s1 s2: Adds all elements visible via s2 to the end of the store of s1, and returns a new seq s3 where the new elements are visible. The views provided by s1 and s2 do not change.
val flush : ('a, 'b) seq -> unit
Flushes the underlying store
val length : ('a, 'b) seq -> int
val length64 : ('a, 'b) seq -> int64
val hdl : int -> ('a, [> `R ]) seq -> 'a list
returns the first up to n elements as list
val iter : ('a -> unit) -> ('a, [> `R ]) seq -> unit
iter f s: iterates over the elements of s and calls f x for every element x
val fold : ('a -> 'b -> 'a) -> 'a -> ('b, [> `R ]) seq -> 'a
fold f a0 s: if the sequence contains the elements [x0; x1; ...; x(n-1)] this function computes
	  f (f (f a0 x0) x1 ... ) x(n-1)

val map : ('a -> 'b) ->
('a, [> `R ]) seq ->
('b, [> `W ] as 'c) seq -> ('b, 'c) seq
let s3 = map f s1 s2: Maps the elements of s1 and adds them to s2
val mapl : ('a -> 'b list) ->
('a, [> `R ]) seq ->
('b, [> `W ] as 'c) seq -> ('b, 'c) seq
let s3 = map f s1 s2: Maps the elements of s1 to lists and adds them to s2
val sort : hash:('a -> int) ->
cmp:('a -> 'a -> int) ->
('a, [> `R ]) seq ->
('a, [> `W ] as 'b) seq -> ('a, 'b) seq
sort ~hash ~cmp s1 s2: Sorts the elements of s1 and appends the sorted result to s2, and returns this sequence.

It is first sorted by the value hash x of each element of s1. If this results in the same value for two elements x and y, these elements are directly compared with cmp x y.

The function hash may return a non-negative 30 bit wide integer. cmp can return 0 meaning equality, a positive number meaning that that the first argument is bigger, or a negative number meaning that the second argument is bigger.

It is required that cmp x y = 0 implies hash x = hash y.

Currently, only in-memory sorting is implemented.

val mapl_sort_fold : mapl:('a -> 'b list) ->
hash:('b -> int) ->
cmp:('b -> 'b -> int) ->
initfold:(int -> 'c) ->
fold:('c -> 'b -> 'c * 'd list) ->
?finfold:('c -> 'd list) ->
partition_of:('b -> int) ->
partitions:int ->
'a Mapred_toolkit.Place.t ->
'd Mapred_toolkit.Place.t -> ('d, [ `W ]) seq list
mapl_sort_fold ... p_in p_out:

This is a locally running version of Mapred_toolkit.DSeq.mapl_sort_fold, useful for testing map/reduce programs on small amounts of data. The following description is formal; for an easier to understand version look at the tutorial Plasmamr_toolkit.

The function returns a list of sequences

 [s0; s1; ...; s(P-1)] 

where P=partitions. Each s(k) is the result of applying a fold pass to p(k) which is defined below. The list

 [p0; p1; ...; p(P-1)] 

are the sequences containing the sorted and mapped input, split into P partitions. In particular, the partition p(k) contains the sorted elements x where partition_of x = k. The elements x are taken from the set M = { mapl(y) | y in p_in} .

The sorting criterion is defined by hash and cmp. In order to decide the order of the elements x of p(k), the elements are first compared by hash x. If for two elements x1 and x2 the value hash x1 is smaller than hash x2, x1 is sorted before x2. If hash x1 is bigger than hash x2, x1 is sorted after x2. If the hash values are identical, the function cmp x1 x2 is called, and if a negative value is returned, x1 is sorted before x2, and if a positive value is returned, x1 is sorted after x2. The result 0 means that x1 and x2 are considered equal.

hash must return non-negative 30 bit integers. It is required that cmp x y = 0 implies hash x = hash y.

The fold pass is defined as follows: The sequence s(k) is computed from p(k) by the following scheme. Assumed that

 s(k) = [x0; x1; ...; x(n-1)] 

we define

 p(k) = out0 @ out1 @ ... @ out(n) 

where the out(i) part lists are defined by:

	  let a0 = initfold k
	  let (a1, out0) = fold a0 x0
	  let (a2, out1) = fold a1 x1
	  ...
	  let (a(n), out(n-1)) = fold a(n-1) x(n-1)
	  let out(n) = finfold a(n)

If finfold is not passed as argument, it is assumed to be finfold = fun _ -> [].

As a side effect, for each output sequence s(k) a new file is created in pl_out.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml