Plasma GitLab Archive
Projects Blog Knowledge

Module Mapred_sched

module Mapred_sched: sig .. end
Scheduler

type plan_config 
val configure_plan : ?keep_temp_files:bool ->
planning_capacity:float ->
internal_suffix:string ->
output_suffix:string ->
Mapred_def.mapred_job_config ->
Mapred_config.mapred_config -> plan_config
configure_plan jc conf

Parameters:

  • keep_temp_files: if true, temporary files created during the map/reduce execution are not immediately deleted
  • planning_capacity: how many cores are totally available. This parameter can be retrieved at runtime via Mapred_job_exec.planning_capacity.
  • internal_suffix: This is the filename suffix added to names of intermediate files
  • output_suffix: This is the filename suffix added to names of final files

type plan 
A plan contains:
  • tasks
  • dependencies between tasks
  • whether a task is done or not done

val create_plan : ?dn_identities:string list ->
Mapred_fs.filesystem -> plan_config -> plan
Creates a plan. The plan is empty.

  • dn_identities: This is an optional list of datanode identities. These are used in some circumstances as preferences for data blocks (currently only for the output of emap jobs)

val bigblock_size : plan -> int
The effective size of bigblocks. This is the value passed to configure_plan (via jc) rounded up to the next multiple of blocks.
val add_inputs : plan -> unit
Add the input files, and create map tasks. This involves the analysis of the block layout of the input files (expensive operation).
val add_map_output : plan ->
int ->
(Mapred_tasks.file_tag * Mapred_tasks.file) list -> Unix.inet_addr -> unit
Add these files as output of this map/emap task identified by this ID

The IP addr points to the machine that executed the map or emap task (which is also the likely storage for the files)

val plan_complete : plan -> bool
Respond whether everything is planned
val complete_inputs : plan -> unit
Declare that no more inputs will be added. This triggers the rest of the graph construction.
val executable_tasks : plan -> Mapred_tasks.task list
Returns all tasks that are executable. This list can be quite long! The order of this list suggests an optimal order of execution (from a "static" point of view)
val hosts : plan -> (string * Unix.inet_addr) list
returns all configured task server hosts
val mark_as_finished : plan -> Mapred_tasks.task -> unit
Marks this task as finished
val mark_as_started : plan ->
Mapred_tasks.task -> Unix.inet_addr -> int -> bool -> unit
Marks this task as started, together with the node where it is running, and the runtime task ID and the misallocation flag
val remove_marks : plan -> Mapred_tasks.task -> unit
Revokes mark_as_started or mark_as_finished
val task_depends_on_list : plan -> Mapred_tasks.task -> Mapred_tasks.task list
Returns the list of tasks a given task depends on
val plan_finished : plan -> bool
Whether everything is done
val n_running : plan -> int
val n_finished : plan -> int
val n_total : plan -> int
Stats of tasks
val avg_running : plan -> float
Average number of running tasks
val avg_runnable : plan -> float
Average number of runnable tasks (excl. running tasks)
val avg_runqueue : plan -> float
Average number of running or runnable tasks
val round_points : plan -> Mapred_tasks.task -> float
Returns points for the task, such that tasks in later rounds get more points than tasks in earlier rounds.
val greediness_points : plan -> Mapred_tasks.task -> float
Returns points for the task, and the points indicate how valuable the task is for a "greedy" job execution strategy. The more points the more valuable.

If the (imagined) completion of the task t enabled that other tasks could be run, t gets as many points as tasks would be newly runnable. The idea here is to prefer tasks that make the most other tasks runnable ("greediness").

If t does not yet make a task u runnable, but just fulfills one more precondition among others, t does not get a whole point for u but just the fraction r/n where r are the fulfilled preconditions of u, and n are all preconditions of u.

val print_plan : Netchannels.out_obj_channel -> plan -> unit
Debug printing to stdout
val generate_svg : plan -> string
Return the graph as SVG (once plan complete)
val task_stats : plan -> Mapred_tasks.task -> int * int
Returns:
  • number of input files
  • number of non-local input files
Raise Not_found if the task has never been started
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml