blob: 6034d6e7113c6e44fe8d80b4b1f8feb08dd7b5e4 [file] [log] [blame]
Spyridon Mastorakisf34b3192015-02-16 17:42:01 -08001How to speed up simulations by parallel execution
2-------------------------------------------------
3
4A way to speed up your simulations is to run them in parallel taking advantage of the power of
5all the processors and the memory availability of your machine. This can be done by using the
6Message Passing Interface (MPI) along with the distributed simulator class `provided by NS-3
7<http://www.nsnam.org/docs/models/html/distributed.html#mpi-for-distributed-simulation>`_.
8
9To make use of MPI, the network topology needs to be partitioned in a proper way, as the
10potential speedup will not be able to exceed the number of topology partitions. However, it
11should be noted that dividing the simulation for distributed purposes in NS-3 can only occur
12across point-to-point links. Currently, only the applications running on a node can be
13executed in a separate logical processor, while the whole network topology will be created in
14each parallel execution. Lastly, MPI requires the exchange of messages among the logical
15processors, thus imposing a communication overhead during the execution time.
16
17Designing a parallel simulation scenario
18----------------------------------------
19
20In order to run simulation scenarios using MPI, all you need is to partition your network
21topology in a proper way. That is to say, to maximize benefits of the parallelization, you
22need to equally distribute the workload for each logical processor.
23
24The full topology will always be created in each parallel execution (on each "rank" in MPI
25terms), regardless of the individual node system IDs. Only the applications are specific to a
26rank. For example, consider node 1 on logical processor (LP) 1 and node 2 on LP 2, with a
27traffic generator on node 1. Both node 1 and node 2 will be created on both LP 1 and LP 2;
28however, the traffic generator will only be installed on LP 1. While this is not optimal for
29memory efficiency, it does simplify routing, since all current routing implementations in ns-3
30will work with distributed simulation.
31
32For more information, you can take a look at the `NS-3 MPI documentation
33<http://www.nsnam.org/docs/models/html/distributed.html#mpi-for-distributed-simulation>`_.
34
35Compiling and running ndnSIM with MPI support
36---------------------------------------------
37
38- Install MPI
39
40 On Ubuntu:
41
42 .. code-block:: bash
43
44 sudo apt-get install openmpi-bin openmpi-common openmpi-doc libopenmpi-dev
45
46 On Fedora:
47
48 .. code-block:: bash
49
50 sudo yum install openmpi openmpi-devel
51
52 On OS X with HomeBrew:
53
54 .. code-block:: bash
55
56 brew install open-mpi
57
58- Compile ndnSIM with MPI support
59
60 You can compile ndnSIM with MPI support using ./waf configure by adding the parameter
61 ``--enable-mpi`` along with the parameters of your preference. For example, to configure
62 with examples and MPI support in optimized mode:
63
64 .. code-block:: bash
65
66 cd <ns-3-folder>
67 ./waf configure -d optimized --enable-examples --enable-mpi
68
69- Run ndnSIM with MPI support
70
71 To run a simulation scenario using MPI, you need to type:
72
73 .. code-block:: bash
74
75 mpirun -np <number_of_processors> ./waf --run=<scenario_name>
76
77
78.. _simple scenario with MPI support:
79
80Simple parallel scenario using MPI
81----------------------------------
82
83This scenario simulates a network topology consisting of two nodes in parallel. Each node
84is assigned to a dedicated logical processor.
85
86The default parallel synchronization strategy implemented in the DistributedSimulatorImpl
87class is based on a globally synchronized algorithm using an MPI collective operation to
88synchronize simulation time across all LPs. A second synchronization strategy based on local
89communication and null messages is implemented in the NullMessageSimulatorImpl class, For
90the null message strategy the global all to all gather is not required; LPs only need to
91communication with LPs that have shared point-to-point links. The algorithm to use is
92controlled by which the ns-3 global value SimulatorImplementationType.
93
94The strategy can be selected according to the value of nullmsg. If nullmsg is true, then
95the local communication strategy is selected. If nullmsg is false, then the globally
96synchronized strategy is selected. This parameter can be passed either as a command line
97argument or by directly modifying the simulation scenario.
98
99The best algorithm to use is dependent on the communication and event scheduling pattern for
100the application. In general, null message synchronization algorithms will scale better due
101to local communication scaling better than a global all-to-all gather that is required by
102DistributedSimulatorImpl. There are two known cases where the global synchronization performs
103better. The first is when most LPs have point-to-point link with most other LPs, in other
104words the LPs are nearly fully connected. In this case the null message algorithm will
105generate more message passing traffic than the all-to-all gather. A second case where the
106global all-to-all gather is more efficient is when there are long periods of simulation time
107when no events are occurring. The all-to-all gather algorithm is able to quickly determine
108then next event time globally. The nearest neighbor behavior of the null message algorithm
109will require more communications to propagate that knowledge; each LP is only aware of
110neighbor next event times.
111
112The following code represents all that is necessary to run such this simple parallel scenario
113
114.. literalinclude:: ../../examples/ndn-simple-mpi.cpp
115 :language: c++
116 :linenos:
117 :lines: 22-35,71-
118 :emphasize-lines: 41-44, 54-58, 78-79, 89-90
119
120If this code is placed into ``scratch/ndn-simple-mpi.cpp`` or NS-3 is compiled with examples
121enabled, you can compare runtime on one and two CPUs using the following commands::
122
123 # 1 CPU
124 mpirun -np 1 ./waf --run=ndn-simple-mpi
125
126 # 2 CPUs
127 mpirun -np 2 ./waf --run=ndn-simple-mpi
128
129
130The following table summarizes 9 executions on OS X 10.10 and 2.3 GHz Intel Core i7 a single
131CPU, on two CPUs with global synchronization, and on two CPUs with null message
132synchronization:
133
134+-------------+-----------------+------------------+----------------+
135| # of CPUs | Real time, s | User time, s | System time, s |
136+=============+=================+==================+================+
137| 1 | 20.9 +- 0.14 | 20.6 +- 0.13 | 0.2 +- 0.01 |
138+-------------+-----------------+------------------+----------------+
139| 2 (global) | 11.1 +- 0.13 | 21.9 +- 0.24 | 0.2 +- 0.02 |
140+-------------+-----------------+------------------+----------------+
141| 2 (nullmsg) | 11.4 +- 0.12 | 22.4 +- 0.21 | 0.2 +- 0.02 |
142+-------------+-----------------+------------------+----------------+
143
144Note that MPI not always will result in simulation speedup and can actually result in
145performance degradation. This means that either network is not properly partitioned or the
146simulation cannot take advantage of the partitioning (e.g., the simulation time is dominated by
147the application on one node).