Blame - docs/source/parallel-simulations.rst - ndnSIM

blob: 6034d6e7113c6e44fe8d80b4b1f8feb08dd7b5e4 [file] [log] [blame]

Spyridon Mastorakis	f34b319	2015-02-16 17:42:01 -0800	[diff] [blame]	1	How to speed up simulations by parallel execution
				2	-------------------------------------------------
				3
				4	A way to speed up your simulations is to run them in parallel taking advantage of the power of
				5	all the processors and the memory availability of your machine. This can be done by using the
				6	Message Passing Interface (MPI) along with the distributed simulator class `provided by NS-3
				7	<http://www.nsnam.org/docs/models/html/distributed.html#mpi-for-distributed-simulation>`_.
				8
				9	To make use of MPI, the network topology needs to be partitioned in a proper way, as the
				10	potential speedup will not be able to exceed the number of topology partitions. However, it
				11	should be noted that dividing the simulation for distributed purposes in NS-3 can only occur
				12	across point-to-point links. Currently, only the applications running on a node can be
				13	executed in a separate logical processor, while the whole network topology will be created in
				14	each parallel execution. Lastly, MPI requires the exchange of messages among the logical
				15	processors, thus imposing a communication overhead during the execution time.
				16
				17	Designing a parallel simulation scenario
				18	----------------------------------------
				19
				20	In order to run simulation scenarios using MPI, all you need is to partition your network
				21	topology in a proper way. That is to say, to maximize benefits of the parallelization, you
				22	need to equally distribute the workload for each logical processor.
				23
				24	The full topology will always be created in each parallel execution (on each "rank" in MPI
				25	terms), regardless of the individual node system IDs. Only the applications are specific to a
				26	rank. For example, consider node 1 on logical processor (LP) 1 and node 2 on LP 2, with a
				27	traffic generator on node 1. Both node 1 and node 2 will be created on both LP 1 and LP 2;
				28	however, the traffic generator will only be installed on LP 1. While this is not optimal for
				29	memory efficiency, it does simplify routing, since all current routing implementations in ns-3
				30	will work with distributed simulation.
				31
				32	For more information, you can take a look at the `NS-3 MPI documentation
				33	<http://www.nsnam.org/docs/models/html/distributed.html#mpi-for-distributed-simulation>`_.
				34
				35	Compiling and running ndnSIM with MPI support
				36	---------------------------------------------
				37
				38	- Install MPI
				39
				40	On Ubuntu:
				41
				42	.. code-block:: bash
				43
				44	sudo apt-get install openmpi-bin openmpi-common openmpi-doc libopenmpi-dev
				45
				46	On Fedora:
				47
				48	.. code-block:: bash
				49
				50	sudo yum install openmpi openmpi-devel
				51
				52	On OS X with HomeBrew:
				53
				54	.. code-block:: bash
				55
				56	brew install open-mpi
				57
				58	- Compile ndnSIM with MPI support
				59
				60	You can compile ndnSIM with MPI support using ./waf configure by adding the parameter
				61	``--enable-mpi`` along with the parameters of your preference. For example, to configure
				62	with examples and MPI support in optimized mode:
				63
				64	.. code-block:: bash
				65
				66	cd <ns-3-folder>
				67	./waf configure -d optimized --enable-examples --enable-mpi
				68
				69	- Run ndnSIM with MPI support
				70
				71	To run a simulation scenario using MPI, you need to type:
				72
				73	.. code-block:: bash
				74
				75	mpirun -np <number_of_processors> ./waf --run=<scenario_name>
				76
				77
				78	.. _simple scenario with MPI support:
				79
				80	Simple parallel scenario using MPI
				81	----------------------------------
				82
				83	This scenario simulates a network topology consisting of two nodes in parallel. Each node
				84	is assigned to a dedicated logical processor.
				85
				86	The default parallel synchronization strategy implemented in the DistributedSimulatorImpl
				87	class is based on a globally synchronized algorithm using an MPI collective operation to
				88	synchronize simulation time across all LPs. A second synchronization strategy based on local
				89	communication and null messages is implemented in the NullMessageSimulatorImpl class, For
				90	the null message strategy the global all to all gather is not required; LPs only need to
				91	communication with LPs that have shared point-to-point links. The algorithm to use is
				92	controlled by which the ns-3 global value SimulatorImplementationType.
				93
				94	The strategy can be selected according to the value of nullmsg. If nullmsg is true, then
				95	the local communication strategy is selected. If nullmsg is false, then the globally
				96	synchronized strategy is selected. This parameter can be passed either as a command line
				97	argument or by directly modifying the simulation scenario.
				98
				99	The best algorithm to use is dependent on the communication and event scheduling pattern for
				100	the application. In general, null message synchronization algorithms will scale better due
				101	to local communication scaling better than a global all-to-all gather that is required by
				102	DistributedSimulatorImpl. There are two known cases where the global synchronization performs
				103	better. The first is when most LPs have point-to-point link with most other LPs, in other
				104	words the LPs are nearly fully connected. In this case the null message algorithm will
				105	generate more message passing traffic than the all-to-all gather. A second case where the
				106	global all-to-all gather is more efficient is when there are long periods of simulation time
				107	when no events are occurring. The all-to-all gather algorithm is able to quickly determine
				108	then next event time globally. The nearest neighbor behavior of the null message algorithm
				109	will require more communications to propagate that knowledge; each LP is only aware of
				110	neighbor next event times.
				111
				112	The following code represents all that is necessary to run such this simple parallel scenario
				113
				114	.. literalinclude:: ../../examples/ndn-simple-mpi.cpp
				115	:language: c++
				116	:linenos:
				117	:lines: 22-35,71-
				118	:emphasize-lines: 41-44, 54-58, 78-79, 89-90
				119
				120	If this code is placed into ``scratch/ndn-simple-mpi.cpp`` or NS-3 is compiled with examples
				121	enabled, you can compare runtime on one and two CPUs using the following commands::
				122
				123	# 1 CPU
				124	mpirun -np 1 ./waf --run=ndn-simple-mpi
				125
				126	# 2 CPUs
				127	mpirun -np 2 ./waf --run=ndn-simple-mpi
				128
				129
				130	The following table summarizes 9 executions on OS X 10.10 and 2.3 GHz Intel Core i7 a single
				131	CPU, on two CPUs with global synchronization, and on two CPUs with null message
				132	synchronization:
				133
				134	+-------------+-----------------+------------------+----------------+
				135	\| # of CPUs \| Real time, s \| User time, s \| System time, s \|
				136	+=============+=================+==================+================+
				137	\| 1 \| 20.9 +- 0.14 \| 20.6 +- 0.13 \| 0.2 +- 0.01 \|
				138	+-------------+-----------------+------------------+----------------+
				139	\| 2 (global) \| 11.1 +- 0.13 \| 21.9 +- 0.24 \| 0.2 +- 0.02 \|
				140	+-------------+-----------------+------------------+----------------+
				141	\| 2 (nullmsg) \| 11.4 +- 0.12 \| 22.4 +- 0.21 \| 0.2 +- 0.02 \|
				142	+-------------+-----------------+------------------+----------------+
				143
				144	Note that MPI not always will result in simulation speedup and can actually result in
				145	performance degradation. This means that either network is not properly partitioned or the
				146	simulation cannot take advantage of the partitioning (e.g., the simulation time is dominated by
				147	the application on one node).