.. _example parallel execution (MPI): Example: Parallel execution (MPI) --------------------------------- |Nmag|'s numerical core (which is part of the |nsim| multi-physics library) has been designed to carry out numerical computation on several CPUs simultaneously. The protocol that we are using for this is the wide spread Message Passing Interface (MPI). There are a number of MPI implementations; the best known ones are probably MPICH1, MPICH2 and LAM-MPI. Currently, we support `MPICH1 `_ and `MPICH2 `_. Which mpi version to use? Whether you want to use mpich1 or mpich2 will depend on your installation: currently, the installation from source provides mpich2 (which is also used in the virtual machines) whereas the Debian package relies on mpich1 (no Debian package is provided after release 0.1-6163). .. _Using mpich2: Using mpich2 ~~~~~~~~~~~~ Before the actual simulation is started, a `multi-purpose daemon` must be started when using MPICH2. ----------------------------------- **The ``.mpd.conf`` file** MPICH2 will look for a configuration file with name ``.mpd.conf`` in the user's home directory. If this is missing, an attempt to start the multi-purpose daemon, will result in an error message like this:: $> mpd configuration file /Users/fangohr/.mpd.conf not found A file named .mpd.conf file must be present in the user's home directory (/etc/mpd.conf if root) with read and write access only for the user, and must contain at least a line with: MPD_SECRETWORD= One way to safely create this file is to do the following: cd $HOME touch .mpd.conf chmod 600 .mpd.conf and then use an editor to insert a line like MPD_SECRETWORD=mr45-j9z into the file. (Of course use some other secret word than mr45-j9z.) If you don't have this file in your home directory, just follow the instructions above to create it with some secret word of your choice (Note that the above example is from a Mac OS X system: on Linux the home directory is usually under ``/home/USERNAME`` rather than ``/Users/USERNAME`` as shown here.) ------------------------------------- Let's assume we have a multi-core machine with more than one CPU. This makes the mpi setup slightly easier, and is also likely to be more efficient than running a job across the network between difference machines. First, we need to start the multi-purpose daemon:: $> mpd & It will look for the file ``~/.mpd.conf`` as described above. If found, it will start silently. Otherwise it will complain. .. _Testing that nsim executes in parallel: Testing that nsim executes in parallel ++++++++++++++++++++++++++++++++++++++ First, let's make sure that ``nsim`` is in the search path. The command ``which nsim`` will return the location of the executable if it can be found in the search path. For example:: $> which nsim /home/fangohr/new/nmag-0.1/bin/nsim To execute nsim using two processes, we can use the command:: $> mpiexec -n 2 nsim There are two useful commands to check whether nsim is aware of the intended MPI setup. The fist one is ``ocaml.mpi_status()`` which provides the total number of processes in the MPI set-up:: $> mpiexec -n 2 nsim >>> ocaml.mpi_status() MPI-status: There are 2 nodes (this is the master, rank=0) >>> The other command is ``ocaml.mpi_hello()`` and prints a short 'hello' from all processes:: >>> ocaml.mpi_hello() >>> [Node 0/2] Hello from beta.kk.soton.ac.uk [Node 1/2] Hello from beta.kk.soton.ac.uk For comparison, let's look at the output of these commands if we start ``nsim`` *without* MPI, in which case only one MPI node is reported:: $> nsim >>> ocaml.mpi_status() MPI-status: There are 1 nodes (this is the master, rank=0) >>> ocaml.mpi_hello() [Node 0/1] Hello from beta.kk.soton.ac.uk Assuming this all works, we can now start the actual simulation. To use two CPUs on the local machine to run the ``bar30_30_100.py`` program, we can use:: $> mpiexec -n 2 nsim bar30_30_100.py To run the program again, using 4 CPUs on the local machine:: $> mpiexec -n 4 nsim bar30_30_100.py Note that mpich2 (and mpich1) will spawn more processes than there are CPUs if necessary. I.e. if you are working on some Intel Dual Core processor (with 2 CPUs and one core each) but request to run your program with 4 (via the ``-n 4`` switch given to ``mpiexec``), than you will have 4 processes running on the 2 CPUs. If you want to stop the ``mpd`` daemon, you can use:: $> mpdallexit For diagnostic purposes, the ``mpdtrace`` command can be use to track whether a multipurpose daemon is running (and which machines are part of the `mpi-ring`). **Advanced usage of mpich2** To run a job across different machines, one needs to start the multi-purpose daemons on the other machines with the ``mpdboot`` command. This will search for a file (in the current directory) with name ``mpd.hosts`` which should contain a list of hosts to participate (very similar to the ``machinefile`` in MPICH1). To trace which process is sending what messages to the standard out, one can add the ``-l`` switch to the ``mpiexec`` command: then each line of standard output will be preceded by the rank of the process who has issued the message. Please refer to the official `MPICH2 `__ documentation for further details. .. _Using mpich1: Using mpich1 ~~~~~~~~~~~~ Note: Most users will use MPICH2 (if they have compiled Nmag from the tar-ball): see :ref:`Using mpich2` Suppose we would like to run :ref:`example 2` of the manual with 2 processors using MPICH1. We need to know the full path to the ``nsim`` executable. In a ``bash`` environment (this is pretty much the standard on Linux and Mac OS X nowadays), you can find the path using the ``which`` command. On a system where nsim was installed from the Debian package:: $> which nsim /usr/bin/nsim Let's assume we have a multi-core machine with more than one CPU. This makes the mpi setup slightly easier, and is also likely to be more efficient than running a job across the network between difference machines. In that case, we can run the example on 2 CPUs using:: $> mpirun -np 2 /usr/bin/nsim bar30_30_100.py where ``-np`` is the command line argument for the Number of Processors. To check that the code is running on more than one CPU, one of the first few log messages will display (in addition to the runid of the simulation) the number of CPUs used:: $> mpirun -np 2 `which nsim` bar30_30_100.py nmag:2008-05-20 12:50:01,177 setup.py 269 INFO Runid (=name simulation) is 'bar30_30_100', using 2 CPUs To use 4 processors (if we have a quad core machine available), we would use:: $> mpirun -np 4 /usr/bin/nsim bar30_30_100.py Assuming that the ``nsim`` executable is in the path, and that we are using a bash-shell, we could shortcut the step of finding the ``nsim`` executable by writing:: $> mpirun -np 4 `which nsim` bar30_30_100.py To run the job across the network on different machines simultaneously, we need to create a file with the names of the hosts that should be used for the parallel execution of the program. If you intend to use nmag on a cluster, your cluster administrator should explain where to find this machine file. .. comment: If you intend to run the program just on one machine but want to exploit the fact that this machine has more than one CPU (as most newer computers have typically one (or more) multi-core processors), then it is sufficient to write ``localhost`` into a file which we will call ``machines.txt``:: $> echo localhost > machines.txt To distribute a job across ``machine1.mydomain``, ``machine2.mydomain``, and ``machine3.mydomain`` we need to create the file ``machines.txt`` with content:: machine1.mydomain machine2.mydomain machine3.mydomain We then need to pass the name of this file to the ``mpirun`` command to run a (mpi-enabled) executable with mpich:: mpirun -machinefile machines.txt -np 3 /usr/bin/nsim bar30_30_100.py For further details, please refer to the `MPICH1 `__ documentation. .. _Visualising the partition of the mesh: Visualising the partition of the mesh ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use Metis to partition the mesh. Partitioning means to allocate certain mesh nodes to certain CPUs. Generally, it is good if nodes that are spatially close to each other are assigned to the same CPU. Here we demonstrate how the chosen partition can be visualised. As an example, we use the :ref:`Example: demag field in uniformly magnetised sphere `. We are :ref:`using mpich2`:: $> mpd & $> mpiexec -l -n 3 nsim sphere1.py The program starts, and prints the chose partition to stdout:: nfem.ocaml:2008-05-28 15:11:07,757 INFO Calling ParMETIS to partition the me sh among 3 processors nfem.ocaml:2008-05-28 15:11:07,765 INFO Processor 0: 177 nodes nfem.ocaml:2008-05-28 15:11:07,765 INFO Processor 1: 185 nodes nfem.ocaml:2008-05-28 15:11:07,766 INFO Processor 2: 178 nodes If you can't find the information on the screen (=stdout), then have a look in ``sphere1_log.log`` which contains a copy of the log messages that have been printed to stdout. If we save any fields spatially resolved (as with the ``sim.save_data(fields='all')`` command), then nmag will create a file with name (in this case) ``sphere1_dat.h5``. In addition to the field data that is saved, it also stores the finite element mesh *in the order that was used when the file was created*. In this example, this is the mesh ordered according to the output from the ParMETIS package. The first 177 nodes of the mesh in this order are assigned to CPU0, the next 185 are assigned to CPU1, and the next 178 are assigned to CPU2. We can visualise this partition using the :ref:`nmeshpp` command (which we apply here to the mesh that is saved in the ``sphere1_dat.h5`` file):: $> nmeshpp --partitioning=[177,185,178] sphere1_dat.h5 partitioning.vtk The new file ``partitioning.vtk`` contains only one field on the mesh, and this has assigned to each mesh node the id of the associated CPU. We can visualise this, for example, using:: $> mayavi -d partitioning.vtk -m SurfaceMap .. image:: /example_mpi1/sphere3partitions.png :align: center :width: 800 :height: 664 The figure shows that the sphere has been divided into three areas which carry values 0, 1 and 2 (corresponding to the MPI CPU rank which goes from 0 to 2 for 3 CPUs). Actually, in this plot we can only see the surface nodes (but the volume nodes have been partitioned accordingly). The process described here is a bit cumbersome to visualise the partition. This could in principle be streamlined (so that we save the partition data into the ``_dat.h5`` data file and can generate the visualisation without further manual intervention). However, we expect that this is not a show stopper and will dedicate our time to more pressing issues. (User feedback and suggestions for improvements are of course always welcome.) .. _Performance: Performance ~~~~~~~~~~~ Here is some data we have obtained on an IBM x440 system (with eight 1.9Ghz Intel Xeon processors). We use a test simulation (located in ``tests/devtests/nmag/hyst/hyst.par``) which computes a hysteresis loop for a fairly small system (4114 mesh nodes, 1522 surface nodes, BEM size 18MB). We use overdamped time integration to determine the meta-stable states. Both the setup and the time required to write data will not become significantly faster when run on more than one CPU. We provide: **total time**: this includes setup time, time for the main simulation loop and time for writing data (measured in seconds) **total speedup**: The speed up for the total execution time (i.e. ratio of execution time on one CPU to execution time on n CPUs). **sim time**: this is the time spend in the main simulation loop (and this is where expect a speed up) **sim speedup**: the speedup of the main simulation loop ===== ========== ============= ======== =========== CPUs total time total speedup sim time sim speedup ----- ---------- ------------- -------- ----------- 1 4165 1.00 3939 1.00 ----- ---------- ------------- -------- ----------- 2 2249 1.85 2042 1.93 ----- ---------- ------------- -------- ----------- 3 1867 2.23 1659 2.37 ----- ---------- ------------- -------- ----------- 4 1605 2.60 1393 2.83 ===== ========== ============= ======== =========== The numbers shown here have been obtained using mpich2 (and using the ``ssm`` device instead of the default ``sock`` device: this is available on Linux and resulted in a 5% reduction of execution time). Generally, the (network) communication that is required between the nodes will slow down the communication. The smaller the system, the more communication has to happen between the nodes (relative to the amount of time spent on actual calculation). Thus, one expects a better speed up for larger systems. The performance of the network is also crucial: generally, we expect the best speed up on very fast networks and shared memory systems (i.e. multi-CPU / multi-core architectures). We further expect the speed-up to become worse (in comparison to the ideal linear speed-up) with an increasing number of processes. .. _Restarting MPI runs: Restarting MPI runs ------------------- There is one situation that should be avoided when exploiting parallel computation. Usually, a simulation (involving for example a hysteresis loop), can be continued using the ``--restart`` switch. This is also true for MPI runs. However, the number of CPUs used *must not change* between the initial and any subsequent runs. (The reason for this is that the ``_dat.h5`` file needs to store the mesh as it has been reordered for *n* CPUs. If we continue the run with another number of CPUs, the mesh data will not be correct anymore which will lead to errors when extracting the data from the ``_dat.h5`` file.) Note also that there is currently no warning issued (in Nmag 0.1) if a user ventures into such a simulation. .. comment: See here for mpich2 intro: http://magnet.atp.tuwien.ac.at/scholz/magpar/doc/html/examples.html#parallel