By J. David Lee on 2014-10-21.

This post is the second in a series looking at the process of writing
C extension modules for Python using the NumPy API. In part 1 we built
a simple N-body simulation and found that the bottleneck was computing
the mutual forces between bodies, an O(N^{2}) operation. By
implementing the time-evolution function in C, we were able to speed
up the computation by a factor of about 70.

If you haven't read the first post, you should probably skim it before moving on.

In this post we'll trade some generality in our code for improved performance.

`World`

is a class that stores the state of N bodies. Our simulation
will evolve that state over a series of time-steps.

```
class World(object):
"""World is a structure that holds the state of N bodies and
additional variables.
threads : (int) The number of threads to use for multithreaded
implementations.
dt : (float) The time-step.
STATE OF THE WORLD:
N : (int) The number of bodies in the simulation.
m : (1D ndarray) The mass of each body.
r : (2D ndarray) The position of each body.
v : (2D ndarray) The velocity of each body.
F : (2D ndarray) The force on each body.
TEMPORARY VARIABLES:
Ft : (3D ndarray) A 2D force array for each thread's local storage.
s : (2D ndarray) The vectors from one body to all others.
s3 : (1D ndarray) The norm of each s vector.
NOTE: Ft is used by parallel algorithms for thread-local
storage. s and s3 are only used by the Python
implementation.
"""
def __init__(self, N, threads=1,
m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
self.threads = threads
self.N = N
self.m = np.random.uniform(m_min, m_max, N)
self.r = np.random.uniform(-r_max, r_max, (N, 2))
self.v = np.random.uniform(-v_max, v_max, (N, 2))
self.F = np.zeros_like(self.r)
self.Ft = np.zeros((threads, N, 2))
self.s = np.zeros_like(self.r)
self.s3 = np.zeros_like(self.m)
self.dt = dt
```

When the simulation begins, `N`

bodies are randomly assigned a mass,
`m`

, position, `r`

, and velocity, `v`

. For each time-step, the
following computations are made:

- The net force,
`F`

, on each body due to all other bodies is computed. - The velocity,
`v`

, of each body is modified due to the force. - The position,
`r`

, of each body is modified due to its velocity.

The extension module that we created in part 1 used macros to access the elements of NumPy arrays in C. Here's what one of those macros looks like:

```
#define r(x0, x1) (*(npy_float64*)((PyArray_DATA(py_r) + \
(x0) * PyArray_STRIDES(py_r)[0] + \
(x1) * PyArray_STRIDES(py_r)[1])))
```

Using a macro like this, we can access the elements of the `py_r`

array using a simple notation like `r(i, j)`

. The value indexed will
match what you'd see in Python, even if the array has been reshaped or
sliced in some way.

For general purpose code, that's what we'd want. In the case of our simulation, we know the pedigree of our arrays: they're contiguous and never sliced or reshaped. We can use this fact to both simplify and improve the performance of our code.

In this section we'll look at a modified version of our C extension
that does away with the accessor macros and works directly with the
NumPy array's underlying data. The code for this section,
`src/simple2.c`

, is available on
github.

For comparison, the implementation from the previous post is available here.

Starting at the bottom of the file, we can see that the `evolve`

function parses the Python arguments in the same manner as the
previous version, but now we're utilizing the `PyArray_DATA`

macro to
obtain a pointer to the underlying memory. We cast this pointer to an
`npy_float64`

which is an alias for `double`

.

```
static PyObject *evolve(PyObject *self, PyObject *args) {
// Declare variables.
npy_int64 N, threads, steps, step, i, xi, yi;
npy_float64 dt;
PyArrayObject *py_m, *py_r, *py_v, *py_F;
npy_float64 *m, *r, *v, *F;
// Parse arguments.
if (!PyArg_ParseTuple(args, "ldllO!O!O!O!",
&threads,
&dt,
&steps,
&N,
&PyArray_Type, &py_m,
&PyArray_Type, &py_r,
&PyArray_Type, &py_v,
&PyArray_Type, &py_F)) {
return NULL;
}
// Get underlying arrays from numpy arrays.
m = (npy_float64*)PyArray_DATA(py_m);
r = (npy_float64*)PyArray_DATA(py_r);
v = (npy_float64*)PyArray_DATA(py_v);
F = (npy_float64*)PyArray_DATA(py_F);
// Evolve the world.
for(step = 0; step < steps; ++step) {
compute_F(N, m, r, F);
for(i = 0; i < N; ++i) {
// Compute offsets into arrays.
xi = 2 * i;
yi = xi + 1;
v[xi] += F[xi] * dt / m[i];
v[yi] += F[yi] * dt / m[i];
r[xi] += v[xi] * dt;
r[yi] += v[yi] * dt;
}
}
Py_RETURN_NONE;
}
```

The only change from our previous implementation using macros is that
we have to compute our array indices explicitly. We could have cast the
underlying arrays as two-dimensional `npy_float64`

arrays, but that
would require some upfront cost and memory management.

The forces are computed in the same manner as in the previous
implementation, the only differences being the notation, and explicit
index calculations in the nested `for`

-loops.

```
static inline void compute_F(npy_int64 N,
npy_float64 *m,
npy_float64 *r,
npy_float64 *F) {
npy_int64 i, j, xi, yi, xj, yj;
npy_float64 sx, sy, Fx, Fy, s3, tmp;
// Set all forces to zero.
for(i = 0; i < N; ++i) {
F[2*i] = F[2*i + 1] = 0;
}
// Compute forces between pairs of bodies.
for(i = 0; i < N; ++i) {
xi = 2*i;
yi = xi + 1;
for(j = i + 1; j < N; ++j) {
xj = 2*j;
yj = xj + 1;
sx = r[xj] - r[xi];
sy = r[yj] - r[yi];
s3 = sqrt(sx*sx + sy*sy);
s3 *= s3*s3;
tmp = m[i] * m[j] / s3;
Fx = tmp * sx;
Fy = tmp * sy;
F[xi] += Fx;
F[yi] += Fy;
F[xj] -= Fx;
F[yj] -= Fy;
}
}
}
```

I was surprised to find that, compared to our original C extension, this implementation is 45% faster, performing 26108 time-steps per second versus the previous version's 17972 on the same Intel i5 desktop. This represents a factor of 101 improvement over the original Python and NumPy implementation.

In the implementation above, we calculated vector components in the
`x`

and `y`

directions explicitly. If we're willing to give up some
portability, we can use x86 SIMD (Single Instruction Multiple
Data) instructions to simplify our code, and hopefully speed it up as
well.

The code for this section, `src/simd1.c`

, is available on
github.

In code below, we'll be utilizing the vector data type `__m128d`

. The
number 128 refers to the number of bits in the vector, while the
letter "d" indicates the data type of the vector's components. In this
case the components are type `double`

(64 bit floating point). Other
vector widths and types are available depending on the architecture.

A variety of intrinsic functions are available for working with vector data types. Intel has a handy reference available on their site.

I'm working on Debian Wheezy using the GNU gcc compiler. In this
environment the header file defining the available intrinsic data
types and functions is `x86intrin.h`

. This may not be portable across
platforms.

The `evolve`

function shown here is more compact that the version
above. The two-dimensional arrays are cast to `__mm128d`

pointers, and
the use of vectors obviates the need for computing vector components
explicitly.

```
static PyObject *evolve(PyObject *self, PyObject *args) {
// Variable declarations.
npy_int64 N, threads, steps, step, i;
npy_float64 dt;
PyArrayObject *py_m, *py_r, *py_v, *py_F;
npy_float64 *m;
__m128d *r, *v, *F;
// Parse arguments.
if (!PyArg_ParseTuple(args, "ldllO!O!O!O!",
&threads,
&dt,
&steps,
&N,
&PyArray_Type, &py_m,
&PyArray_Type, &py_r,
&PyArray_Type, &py_v,
&PyArray_Type, &py_F)) {
return NULL;
}
// Get underlying arrays from numpy arrays.
m = (npy_float64*)PyArray_DATA(py_m);
r = (__m128d*)PyArray_DATA(py_r);
v = (__m128d*)PyArray_DATA(py_v);
F = (__m128d*)PyArray_DATA(py_F);
// Evolve the world.
for(step = 0; step < steps; ++step) {
compute_F(N, m, r, F);
for(i = 0; i < N; ++i) {
v[i] += F[i] * dt / m[i];
r[i] += v[i] * dt;
}
}
Py_RETURN_NONE;
}
```

The force computation is also more compact than in the previously implementation.

```
static inline void compute_F(npy_int64 N,
npy_float64 *m,
__m128d *r,
__m128d *F) {
npy_int64 i, j;
__m128d s, s2, tmp;
npy_float64 s3;
// Set all forces to zero.
for(i = 0; i < N; ++i) {
F[i] = _mm_set1_pd(0);
}
// Compute forces between pairs of bodies.
for(i = 0; i < N; ++i) {
for(j = i + 1; j < N; ++j) {
s = r[j] - r[i];
s2 = s * s;
s3 = sqrt(s2[0] + s2[1]);
s3 *= s3 * s3;
tmp = s * m[i] * m[j] / s3;
F[i] += tmp;
F[j] -= tmp;
}
}
}
```

While this explicitly vectorized code is clearer and more compact than the previous version, it only runs about 1% faster, performing 26427 time-steps per second. It's likely that the compiler was able to optimize the previous implementation using the same vector instructions.

If full generality isn't necessary, often performance gains can be acheived by accessing NumPy arrays' underlying memory directly in C.

While the explicit use of vector intrinsics didn't provide much benefit in this example, the relative performance difference will increase in a multi-core setting using OpenMP.

In the next part of this series, we'll parallelize both of the implementations shown here using OpenMP, and see how performance scales using multiple cores.

If you have any questions, comments, suggestions, or corrections, please let me know via the contact link.