/home/cityd/production/web-camlcity/back0/trunk/projects/dl/ocamlnet-4.0.1/examples/camlbox/README

======================================================================
Examples for Netcamlbox
======================================================================

Compile with "make all". Only native-code executables are created
(the idea of camlboxes is to maximize performance).

speed:
    This example creates two processes, and one of these sends N
    short messages to the other, using camlboxes as transport medium.

    This is also a good introductory example. Many explanatory comments
    have been added.

    Performance results (quad-core Opteron 1354, 64 bit mode):

    $ /usr/bin/time ./speed 1000_000
    Sum: 6000000
    4.33user 1.94system 0:04.08elapsed 153%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+731minor)pagefaults 0swaps

    Meaning the transfer of the message takes around 4 microseconds
    wall-clock time.

    Note that it is no surprise when this example runs on single-cores
    even faster. The costs for synchronizing the two processes are
    lower, and the example is too simple that there would be any 
    benefit from two cores.


manymult:
    This is about speeding up matrix multiplication on multicores.
    n_workers worker processes are created, and each worker process
    has a camlbox where it waits for new matrices to multiply. 
    The master process also has a camlbox where it receives the
    computed result matrices.

unimult:
    Sequential matrix multiplication, for comparison with manymult.


Performance (quad-core Opteron 1354, 64 bit mode):

In the following table we assume the fixed parameters:

  n=100_000 samples
  n_workers=4

Also, the resulting time is only for the part of the program that
really performs the computation, i.e. the setup time is left out.


size of matrix      | wall-clock time for  | wall-clock time for
                    | manymult (seconds)   | unimult (seconds)
----------------------------------------------------------------------
             5 x 5  |                1.00  |                0.35
	    10 x 10 |                2.34  |                1.31
            15 x 15 |                3.48  |                3.55
            20 x 20 |                5.59  |                7.72
            25 x 25 |                6.29  |               14.46
            30 x 30 |                9.10  |               59.62

For larger matrices, the computation time per matrix grows, and the
overhead of the communication with the worker processes decreases
relatively to the total amount of work to do. This explains why the
manymult program beats unimult only for larger matrices.

Approximately for 15x15 matrices manymult becomes faster than
unimult. If we take the unimult runtime to estimate the time needed
for an individual operation, we get t=35.5us (3.55 / 100000) as the
time an operation must endure so that the multi-core enabled program
is faster. So, as a rule of thumb, the elementary operation should
take at least 35us for the prospect to get a speed-up on multi-cores.

I have no convincing explanation why manymult is more than 4 times
faster for 30x30 matrices. One theory is that the ocaml heap is
smaller in the worker processes than in the single unimult process,
and this speeds garbage collection up.

Btw, the manymult numbers for n_workers=3 are sometimes slightly
better.
This web site is published by Informatikbüro Gerd Stolpmann
Plasma	GitLab	Archive
Projects	Blog	Knowledge