====================================================================== Examples for Netcamlbox ====================================================================== Compile with "make all". Only native-code executables are created (the idea of camlboxes is to maximize performance). speed: This example creates two processes, and one of these sends N short messages to the other, using camlboxes as transport medium. This is also a good introductory example. Many explanatory comments have been added. Performance results (quad-core Opteron 1354, 64 bit mode): $ /usr/bin/time ./speed 1000_000 Sum: 6000000 4.33user 1.94system 0:04.08elapsed 153%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+731minor)pagefaults 0swaps Meaning the transfer of the message takes around 4 microseconds wall-clock time. Note that it is no surprise when this example runs on single-cores even faster. The costs for synchronizing the two processes are lower, and the example is too simple that there would be any benefit from two cores. manymult: This is about speeding up matrix multiplication on multicores. n_workers worker processes are created, and each worker process has a camlbox where it waits for new matrices to multiply. The master process also has a camlbox where it receives the computed result matrices. unimult: Sequential matrix multiplication, for comparison with manymult. Performance (quad-core Opteron 1354, 64 bit mode): In the following table we assume the fixed parameters: n=100_000 samples n_workers=4 Also, the resulting time is only for the part of the program that really performs the computation, i.e. the setup time is left out. size of matrix | wall-clock time for | wall-clock time for | manymult (seconds) | unimult (seconds) ---------------------------------------------------------------------- 5 x 5 | 1.00 | 0.35 10 x 10 | 2.34 | 1.31 15 x 15 | 3.48 | 3.55 20 x 20 | 5.59 | 7.72 25 x 25 | 6.29 | 14.46 30 x 30 | 9.10 | 59.62 For larger matrices, the computation time per matrix grows, and the overhead of the communication with the worker processes decreases relatively to the total amount of work to do. This explains why the manymult program beats unimult only for larger matrices. Approximately for 15x15 matrices manymult becomes faster than unimult. If we take the unimult runtime to estimate the time needed for an individual operation, we get t=35.5us (3.55 / 100000) as the time an operation must endure so that the multi-core enabled program is faster. So, as a rule of thumb, the elementary operation should take at least 35us for the prospect to get a speed-up on multi-cores. I have no convincing explanation why manymult is more than 4 times faster for 30x30 matrices. One theory is that the ocaml heap is smaller in the worker processes than in the single unimult process, and this speeds garbage collection up. Btw, the manymult numbers for n_workers=3 are sometimes slightly better.