Spyridon Mastorakis | f34b319 | 2015-02-16 17:42:01 -0800 | [diff] [blame] | 1 | How to speed up simulations by parallel execution |
| 2 | ------------------------------------------------- |
| 3 | |
| 4 | A way to speed up your simulations is to run them in parallel taking advantage of the power of |
| 5 | all the processors and the memory availability of your machine. This can be done by using the |
| 6 | Message Passing Interface (MPI) along with the distributed simulator class `provided by NS-3 |
| 7 | <http://www.nsnam.org/docs/models/html/distributed.html#mpi-for-distributed-simulation>`_. |
| 8 | |
| 9 | To make use of MPI, the network topology needs to be partitioned in a proper way, as the |
| 10 | potential speedup will not be able to exceed the number of topology partitions. However, it |
| 11 | should be noted that dividing the simulation for distributed purposes in NS-3 can only occur |
| 12 | across point-to-point links. Currently, only the applications running on a node can be |
| 13 | executed in a separate logical processor, while the whole network topology will be created in |
| 14 | each parallel execution. Lastly, MPI requires the exchange of messages among the logical |
| 15 | processors, thus imposing a communication overhead during the execution time. |
| 16 | |
| 17 | Designing a parallel simulation scenario |
| 18 | ---------------------------------------- |
| 19 | |
| 20 | In order to run simulation scenarios using MPI, all you need is to partition your network |
| 21 | topology in a proper way. That is to say, to maximize benefits of the parallelization, you |
| 22 | need to equally distribute the workload for each logical processor. |
| 23 | |
| 24 | The full topology will always be created in each parallel execution (on each "rank" in MPI |
| 25 | terms), regardless of the individual node system IDs. Only the applications are specific to a |
| 26 | rank. For example, consider node 1 on logical processor (LP) 1 and node 2 on LP 2, with a |
| 27 | traffic generator on node 1. Both node 1 and node 2 will be created on both LP 1 and LP 2; |
| 28 | however, the traffic generator will only be installed on LP 1. While this is not optimal for |
| 29 | memory efficiency, it does simplify routing, since all current routing implementations in ns-3 |
| 30 | will work with distributed simulation. |
| 31 | |
| 32 | For more information, you can take a look at the `NS-3 MPI documentation |
| 33 | <http://www.nsnam.org/docs/models/html/distributed.html#mpi-for-distributed-simulation>`_. |
| 34 | |
| 35 | Compiling and running ndnSIM with MPI support |
| 36 | --------------------------------------------- |
| 37 | |
| 38 | - Install MPI |
| 39 | |
| 40 | On Ubuntu: |
| 41 | |
| 42 | .. code-block:: bash |
| 43 | |
| 44 | sudo apt-get install openmpi-bin openmpi-common openmpi-doc libopenmpi-dev |
| 45 | |
| 46 | On Fedora: |
| 47 | |
| 48 | .. code-block:: bash |
| 49 | |
| 50 | sudo yum install openmpi openmpi-devel |
| 51 | |
| 52 | On OS X with HomeBrew: |
| 53 | |
| 54 | .. code-block:: bash |
| 55 | |
| 56 | brew install open-mpi |
| 57 | |
| 58 | - Compile ndnSIM with MPI support |
| 59 | |
| 60 | You can compile ndnSIM with MPI support using ./waf configure by adding the parameter |
| 61 | ``--enable-mpi`` along with the parameters of your preference. For example, to configure |
| 62 | with examples and MPI support in optimized mode: |
| 63 | |
| 64 | .. code-block:: bash |
| 65 | |
| 66 | cd <ns-3-folder> |
| 67 | ./waf configure -d optimized --enable-examples --enable-mpi |
| 68 | |
| 69 | - Run ndnSIM with MPI support |
| 70 | |
| 71 | To run a simulation scenario using MPI, you need to type: |
| 72 | |
| 73 | .. code-block:: bash |
| 74 | |
| 75 | mpirun -np <number_of_processors> ./waf --run=<scenario_name> |
| 76 | |
| 77 | |
| 78 | .. _simple scenario with MPI support: |
| 79 | |
| 80 | Simple parallel scenario using MPI |
| 81 | ---------------------------------- |
| 82 | |
| 83 | This scenario simulates a network topology consisting of two nodes in parallel. Each node |
| 84 | is assigned to a dedicated logical processor. |
| 85 | |
| 86 | The default parallel synchronization strategy implemented in the DistributedSimulatorImpl |
| 87 | class is based on a globally synchronized algorithm using an MPI collective operation to |
| 88 | synchronize simulation time across all LPs. A second synchronization strategy based on local |
| 89 | communication and null messages is implemented in the NullMessageSimulatorImpl class, For |
| 90 | the null message strategy the global all to all gather is not required; LPs only need to |
| 91 | communication with LPs that have shared point-to-point links. The algorithm to use is |
| 92 | controlled by which the ns-3 global value SimulatorImplementationType. |
| 93 | |
| 94 | The strategy can be selected according to the value of nullmsg. If nullmsg is true, then |
| 95 | the local communication strategy is selected. If nullmsg is false, then the globally |
| 96 | synchronized strategy is selected. This parameter can be passed either as a command line |
| 97 | argument or by directly modifying the simulation scenario. |
| 98 | |
| 99 | The best algorithm to use is dependent on the communication and event scheduling pattern for |
| 100 | the application. In general, null message synchronization algorithms will scale better due |
| 101 | to local communication scaling better than a global all-to-all gather that is required by |
| 102 | DistributedSimulatorImpl. There are two known cases where the global synchronization performs |
| 103 | better. The first is when most LPs have point-to-point link with most other LPs, in other |
| 104 | words the LPs are nearly fully connected. In this case the null message algorithm will |
| 105 | generate more message passing traffic than the all-to-all gather. A second case where the |
| 106 | global all-to-all gather is more efficient is when there are long periods of simulation time |
| 107 | when no events are occurring. The all-to-all gather algorithm is able to quickly determine |
| 108 | then next event time globally. The nearest neighbor behavior of the null message algorithm |
| 109 | will require more communications to propagate that knowledge; each LP is only aware of |
| 110 | neighbor next event times. |
| 111 | |
| 112 | The following code represents all that is necessary to run such this simple parallel scenario |
| 113 | |
| 114 | .. literalinclude:: ../../examples/ndn-simple-mpi.cpp |
| 115 | :language: c++ |
| 116 | :linenos: |
| 117 | :lines: 22-35,71- |
| 118 | :emphasize-lines: 41-44, 54-58, 78-79, 89-90 |
| 119 | |
| 120 | If this code is placed into ``scratch/ndn-simple-mpi.cpp`` or NS-3 is compiled with examples |
| 121 | enabled, you can compare runtime on one and two CPUs using the following commands:: |
| 122 | |
| 123 | # 1 CPU |
| 124 | mpirun -np 1 ./waf --run=ndn-simple-mpi |
| 125 | |
| 126 | # 2 CPUs |
| 127 | mpirun -np 2 ./waf --run=ndn-simple-mpi |
| 128 | |
| 129 | |
| 130 | The following table summarizes 9 executions on OS X 10.10 and 2.3 GHz Intel Core i7 a single |
| 131 | CPU, on two CPUs with global synchronization, and on two CPUs with null message |
| 132 | synchronization: |
| 133 | |
| 134 | +-------------+-----------------+------------------+----------------+ |
| 135 | | # of CPUs | Real time, s | User time, s | System time, s | |
| 136 | +=============+=================+==================+================+ |
| 137 | | 1 | 20.9 +- 0.14 | 20.6 +- 0.13 | 0.2 +- 0.01 | |
| 138 | +-------------+-----------------+------------------+----------------+ |
| 139 | | 2 (global) | 11.1 +- 0.13 | 21.9 +- 0.24 | 0.2 +- 0.02 | |
| 140 | +-------------+-----------------+------------------+----------------+ |
| 141 | | 2 (nullmsg) | 11.4 +- 0.12 | 22.4 +- 0.21 | 0.2 +- 0.02 | |
| 142 | +-------------+-----------------+------------------+----------------+ |
| 143 | |
| 144 | Note that MPI not always will result in simulation speedup and can actually result in |
| 145 | performance degradation. This means that either network is not properly partitioned or the |
| 146 | simulation cannot take advantage of the partitioning (e.g., the simulation time is dominated by |
| 147 | the application on one node). |