By J. David Lee on 2014-10-24.

This post is the third in a series that looks at writing a high performance C extension module for Python using the NumPy API. In this post we'll use OpenMP to parallelize the implementations developed in part 2.

`World`

is a class that stores the state of N bodies. Our simulation
will evolve that state over a series of time-steps.

```
class World(object):
"""World is a structure that holds the state of N bodies and
additional variables.
threads : (int) The number of threads to use for multithreaded
implementations.
dt : (float) The time-step.
STATE OF THE WORLD:
N : (int) The number of bodies in the simulation.
m : (1D ndarray) The mass of each body.
r : (2D ndarray) The position of each body.
v : (2D ndarray) The velocity of each body.
F : (2D ndarray) The force on each body.
TEMPORARY VARIABLES:
Ft : (3D ndarray) A 2D force array for each thread's local storage.
s : (2D ndarray) The vectors from one body to all others.
s3 : (1D ndarray) The norm of each s vector.
NOTE: Ft is used by parallel algorithms for thread-local
storage. s and s3 are only used by the Python
implementation.
"""
def __init__(self, N, threads=1,
m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
self.threads = threads
self.N = N
self.m = np.random.uniform(m_min, m_max, N)
self.r = np.random.uniform(-r_max, r_max, (N, 2))
self.v = np.random.uniform(-v_max, v_max, (N, 2))
self.F = np.zeros_like(self.r)
self.Ft = np.zeros((threads, N, 2))
self.s = np.zeros_like(self.r)
self.s3 = np.zeros_like(self.m)
self.dt = dt
```

When the simulation begins, `N`

bodies are randomly assigned a mass,
`m`

, position, `r`

, and velocity, `v`

. For each time-step, the
following computations are made:

- The net force,
`F`

, on each body due to all other bodies is computed. - The velocity,
`v`

, of each body is modified due to the force. - The position,
`r`

, of each body is modified due to its velocity.

Below is the function `compute_F`

from the implementation developed in
the previous
post (full
source
here). This
function computes the mutual forces between each pair of bodies in the
simulation, and has O(N^{2}) complexity.

```
static inline void compute_F(npy_int64 N,
npy_float64 *m,
__m128d *r,
__m128d *F) {
npy_int64 i, j;
__m128d s, s2, tmp;
npy_float64 s3;
// Set all forces to zero.
for(i = 0; i < N; ++i) {
F[i] = _mm_set1_pd(0);
}
// Compute forces between pairs of bodies.
for(i = 0; i < N; ++i) {
for(j = i + 1; j < N; ++j) {
s = r[j] - r[i];
s2 = s * s;
s3 = sqrt(s2[0] + s2[1]);
s3 *= s3 * s3;
tmp = s * m[i] * m[j] / s3;
F[i] += tmp;
F[j] -= tmp;
}
}
}
```

Using this serial implementation, our code is able to step through 26427 time-steps per second on an Intel i5 desktop.

Below is a re-implementation of `compute_F`

using OpenMP to
parallelize the loops. The full source file, `src/simdomp1.c`

, is
available on
github.

```
static inline void compute_F(npy_int64 threads,
npy_int64 N,
npy_float64 *m,
__m128d *r,
__m128d *F,
__m128d *Ft) {
npy_int64 id, i, j, Nid;
__m128d s, s2, tmp;
npy_float64 s3;
#pragma omp parallel private(id, i, j, s, s2, s3, tmp, Nid)
{
id = omp_get_thread_num();
Nid = N * id; // Zero-index in thread-local array Ft.
// Zero out the thread-local force arrays.
for(i = 0; i < N; i++) {
Ft[Nid + i] = _mm_set1_pd(0);
}
// Compute forces between pairs of bodies.
#pragma omp for schedule(dynamic, 8)
for(i = 0; i < N; ++i) {
F[i] = _mm_set1_pd(0);
for(j = i + 1; j < N; ++j) {
s = r[j] - r[i];
s2 = s * s;
s3 = sqrt(s2[0] + s2[1]);
s3 *= s3 * s3;
tmp = s * m[i] * m[j] / s3;
Ft[Nid + i] += tmp;
Ft[Nid + j] -= tmp;
}
}
// Sum the thread-local forces computed above.
#pragma omp for
for(i = 0; i < N; ++i) {
for(id = 0; id < threads; ++id) {
F[i] += Ft[N*id + i];
}
}
}
}
```

The most noticeable difference between the parallel and serial versions
is the appearance of the `Ft`

array. The `Ft`

array provides
thread-local storage for the forces computed by each concurrently
running thread. These local arrays are then summed in a separate loop
to produce the array of forces, `F`

.

Local storage is necessary to avoid a race condition. Imagine, for
example, that one thread is executing with `i=0`

and `j=1`

, while
another is using `i=1`

and `j=2`

. In our algorithm, they may both
attempt to modify `F[1]`

, causing a race.

You can read about race conditions on Wikipedia.

The OpenMP API is implemented using functions such as
`omp_get_thread_num`

, and directives such as `#pragma omp parallel`

.

In the code above we create a number of threads with the ```
#pragma omp
parallel
```

directive. The `private`

directive lists variables that are
private to each thread. All other variables are shared.

The `#pragma omp for`

directive allows the for-loop to proceed in
parallel, with each thread processing different indices. We use the
`schedule(dynamic, 8)`

directive to tell the compiler to use the
dynamic scheduler with a chunk size of 8. The dynamic scheduler is
useful when the time spent in each loop iteration can vary, as it does
in this case.

Note that there is an implicit barrier at the end of each parallel for-loop. All threads must complete the loop before any can advance.

See the OpenMP website for more information on the API.

Because we've added an extra for-loop to sum the thread-local force arrays, we can expect the single-core performance to suffer slightly. In my tests the OpenMP version runs about 3% slower than the version without OpenMP when using a single thread.

With four cores and four threads on an Intel i5 desktop, the OpenMP implementation using SIMD instructions steps through 80342 time-steps per second. This is a factor of 312 times faster than our original implementation in Python using NumPy, and about 3 times faster than our fastest serial implementation.

The plot above shows how the OpenMP code scales with the number of threads on a quad-core i5 desktop. The points labeled "OMP+SIMD" correspond to the implementation described in this post. A similar version, available here, without explicit vector instructions is labeled "OMP". The original version using Python and NumPy is shown for comparison (lower left).

The performance scales well up to the number of physical cores, and drops off sharply as additional threads are added. Note that this isn't true in general, so it's always a good idea to benchmark your code with different numbers of threads.

There are a variety of environment variables that can change the behavior of programs using OpenMP. These can have a substantial affect on performance, particularly on machines with multiple physical processors.

See the GNU libgomp documentation for details.

Here are the benchmarks for all of the implementations we created in this series of posts. The benchmarks were run on a quad-core Intel i5 desktop under Debian Wheezy using gcc 4.7.2.

```
| Implementation | steps / second | Speed-up |
| ---------------- | -------------- | -------- |
| Python+Numpy | 257 | 1 |
| Simple C 1 | 17974 | 70 |
| Simple C 2 | 26136 | 102 |
| SIMD | 26423 | 103 |
| OMP 4 cores | 76657 | 298 |
| OMP+SIMD 4 cores | 80342 | 313 |
```

I ran the same benchmarks on a DigitalOcean 20-core instance, their largest. The best performance was about factor of 3 worse than when running with four threads on the quad-core i5 desktop above. Whether the difference is due to virtualization or shared cores, I don't know.

RunAbove had a promotion for a free month of time on their IBM Power 8 architecture. Just for fun I ran the same benchmarks on their 176-thread instance. I'd never used the PowerPC architecture before, and was happy that the code ran without modification.

Matthew Honnibal submitted an implementation in Cython, available here. This implementation appears to have performance similar to our initial C implementation from part 1.

When performance is critical, a C extension can drastically improve performance. Using SIMD instructions can benefit performance and simplify coding but at the cost of portability. OpenMP support can be added to existing code without much work, and can improve performance on multi-core systems.

If you have any questions, comments, suggestions, or corrections, please let me know via the contact link.