module Mapred_io:Utility library for record-based I/Osig
..end
The record reader can be used to iterate over a whole file, or only a part. For the latter, it is assumed that the file is processed bigblock by bigblock. Of course, it is then possible that the lines do not end at bigblock boundaries. However, it must be determined whether the block reader for bigblock N or the block reader for bigblock N+1 processes such lines. The following rules do that:
class type record_config =object
..end
class type record_reader =object
..end
class type record_writer =object
..end
val bigblock_size : Plasma_client.plasma_cluster -> int -> int
bigblock_size c suggested
: Returns the real bigblock size for
the suggested
value. The real size is rounded up to the
next block multiple.
The blocksize is retrieved from c
(synchronously)
val read_file : Plasma_client.plasma_cluster ->
record_config ->
string -> int64 -> int64 -> record_reader
read_file c rc name block len
: Reads from name
, starting at block
,
ending at block+len-1
. Reading is done in a separate transaction.
Note that len>=1 is a requirement here. Also, block
must be
a multiple of the size of bigblocks.
The cluster c
is set to aborted state when not used. Note that this
also affects all transactions unrelated to read_file
, so is best
to create a separate plasma_cluster
object for reading.
typesync_readers =
(unit -> record_reader) list
typeasync_readers =
(unit -> record_reader Uq_engines.engine) list
val read_multiple : Plasma_client.plasma_cluster ->
record_config ->
readers:[ `Async of async_readers | `Sync of sync_readers ] ->
unit -> record_reader
readers
one
after the other. The readers can be given in a synchronous form,
or in an asynchronous form. The latter is preferrable when the
reader is in asynchronous mode (i.e. when to_fd_e
is
running).val write_file : Plasma_client.plasma_cluster ->
record_config -> string -> record_writer
write_file c rc name
: Appends records to this file (which must already
exist). Writing is done in separate transactions.
As read_file
, the cluster c
is set to aborted state when not used.
Note that this also affects all transactions unrelated to read_file
,
so is best to create a separate plasma_cluster
object for writing.
val write_multiple : Plasma_client.plasma_cluster ->
record_config ->
string ->
int64 ->
create_sync:(string -> int -> string) ->
?create_async:(string -> int -> string Uq_engines.engine) ->
unit -> record_writer
write_multiple c rc prefix limit create_sync create_async ()
:
Writes into a sequence of files
whose names are composed of prefix
followed by an integer k
. The
files are created by calling create_async prefix k
. A new file is
started when the current file reaches the size limit
(in bytes).
create_async
: If passed, this function is used instead of create_sync
in asynchronous mode (i.e. when from_fd_e
is running).
val create_file_e : ?repl:int -> Plasma_client.plasma_cluster -> string -> unit Uq_engines.engine
val create_file : ?repl:int -> Plasma_client.plasma_cluster -> string -> unit
create_file c name
: Creates this file exclusively. repl
is
the replication factor, 0 by default (i.e. use server default).val delete_file : Plasma_client.plasma_cluster -> string -> unit
val file_blocks : Plasma_client.plasma_cluster -> string -> int64