By J. David Lee on 2014-10-19.

It's commonly said that Python isn't a high performance language, and in some sense that's true. However, due to the ecosystem of numeric and scientific packages that has grown around NumPy, reasonable performance can generally be achieved without too much difficulty.

When performance is an issue, run-time is generally dominated by a small number of functions. By rewriting these functions in C, performance can often be drastically improved.

In the first part of this series we'll look at how to write a Python extension in C using NumPy's C API in order to improve the performance of a toy simulation. In future posts we'll build on our solution here to further improve its performance.

The files referenced in this post are available on github.

As a starting point for this exercise, we'll consider a two-dimensional N-body simulation for N bodies under the influence of a gravity-like force.

What follows is a class that will be used to store the state of our world, as well as some temporary variables.

```
# lib/sim.py
class World(object):
"""World is a structure that holds the state of N bodies and
additional variables.
threads : (int) The number of threads to use for multithreaded
implementations.
dt : (float) The time-step.
STATE OF THE WORLD:
N : (int) The number of bodies in the simulation.
m : (1D ndarray) The mass of each body.
r : (2D ndarray) The position of each body.
v : (2D ndarray) The velocity of each body.
F : (2D ndarray) The force on each body.
TEMPORARY VARIABLES:
Ft : (3D ndarray) A 2D force array for each thread's local storage.
s : (2D ndarray) The vectors from one body to all others.
s3 : (1D ndarray) The norm of each s vector.
NOTE: Ft is used by parallel algorithms for thread-local
storage. s and s3 are only used by the Python
implementation.
"""
def __init__(self, N, threads=1,
m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
self.threads = threads
self.N = N
self.m = np.random.uniform(m_min, m_max, N)
self.r = np.random.uniform(-r_max, r_max, (N, 2))
self.v = np.random.uniform(-v_max, v_max, (N, 2))
self.F = np.zeros_like(self.r)
self.Ft = np.zeros((threads, N, 2))
self.s = np.zeros_like(self.r)
self.s3 = np.zeros_like(self.m)
self.dt = dt
```

When the simulation begins, `N`

bodies are randomly assigned a mass,
`m`

, position, `r`

, and velocity, `v`

. For each time-step, the
following computations are made:

- The net force,
`F`

, on each body due to all other bodies is computed. - The velocity,
`v`

, of each body is modified due to the force. - The position,
`r`

, of each body is modified due to its velocity.

The frist step, computing `F`

, will be our bottleneck. The force on a
single body is the sum of forces due to every other body in the
world. This leads to O(N^{2}) complexity. The `v`

and `r`

updates are both O(N).

If you're interested, this Wikipedia article introduces some approximations that can speed up the computation of forces.

An implementation of the time evolution function in pure Python, using NumPy arrays, provides a starting point for optimization, and a reference against which to test other implementations.

```
# lib/sim.py
def compute_F(w):
"""Compute the force on each body in the world, w."""
for i in xrange(w.N):
w.s[:] = w.r - w.r[i]
w.s3[:] = (w.s[:,0]**2 + w.s[:,1]**2)**1.5
w.s3[i] = 1.0 # This makes the self-force zero.
w.F[i] = (w.m[i] * w.m[:,None] * w.s / w.s3[:,None]).sum(0)
def evolve(w, steps):
"""Evolve the world, w, through the given number of steps."""
for _ in xrange(steps):
compute_F(w)
w.v += w.F * w.dt / w.m[:,None]
w.r += w.v * w.dt
```

The O(N^{2}) nature of the net force calculation is masked by
NumPy's array notation. Each array operation performs a loop over the
array elements.

Here's a plot of the paths of seven bodies evolving from a random initial state:

In order to benchmark this implementation, we create a script in the project directory containing the following:

```
import lib
w = lib.World(101)
lib.evolve(w, 4096)
```

We benchmark the script using the `cProfile`

module.

```
python -m cProfile -scum bench.py
```

The first few lines show us that `compute_F`

is indeed our bottleneck, accounting for over 99% of the run-time.

```
428710 function calls (428521 primitive calls) in 16.836 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 16.837 16.837 bench.py:2(<module>)
1 0.062 0.062 16.756 16.756 sim.py:60(evolve)
4096 15.551 0.004 16.693 0.004 sim.py:51(compute_F)
413696 1.142 0.000 1.142 0.000 {method 'sum' ...
3 0.002 0.001 0.115 0.038 __init__.py:1(<module>)
...
```

With 101 bodies on an Intel i5 desktop, this implementation is able to evolve the world through 257 time-steps per second.

In this section we'll look at a C extension module implementing the
`evolve`

function. It may be helpful to have a copy of the C file
while going through this section. The file, `src/simple1.c`

, is
available on
github.

For additional documentation on NumPy's C API, see the NumPy Reference. Python's C API is documented in detail here.

The first thing in the file is a forward declaration of our evolve function. This will be used directly below in the method list.

```
static PyObject *evolve(PyObject *self, PyObject *args);
```

The next thing is the method list.

```
static PyMethodDef methods[] = {
{ "evolve", evolve, METH_VARARGS, "Doc string."},
{ NULL, NULL, 0, NULL } /* Sentinel */
};
```

This is a list of exported methods for the extension module. Here we
only have one method named `evolve`

.

The last bit of boilerplate is the module initialization.

```
PyMODINIT_FUNC initsimple1(void) {
(void) Py_InitModule("simple1", methods);
import_array();
}
```

Note that the name in `initsimple1`

has to match the first argument to
`Py_InitModule`

as shown here. The `import_array`

call is necessary
for any extension using the NumPy API.

The array access macros can be used to correctly index into an array regardless of how it's been reshaped or sliced. These macros also make the code following them much more readable.

```
#define m(x0) (*(npy_float64*)((PyArray_DATA(py_m) + \
(x0) * PyArray_STRIDES(py_m)[0])))
#define m_shape(i) (py_m->dimensions[(i)])
#define r(x0, x1) (*(npy_float64*)((PyArray_DATA(py_r) + \
(x0) * PyArray_STRIDES(py_r)[0] + \
(x1) * PyArray_STRIDES(py_r)[1])))
#define r_shape(i) (py_r->dimensions[(i)])
```

Here we see accessor macros for a one and two dimensional array. Arrays with higher dimensionality can be accessed in a similar fashion.

With the help of these macros, we can loop through `r`

using the
following code:

```
for(i = 0; i < r_shape(0); ++i) {
for(j = 0; j < r_shape(1); ++j) {
r(i, j) = 0; // Zero all elements.
}
}
```

The macros defined above will only work if the matching NumPy array
objects are defined with the correct names. In the code above, the
arrays are named `py_m`

and `py_r`

. In order to use the same macros
across different functions, the names of the NumPy arrays need to be
kept consistent.

The function that computes the force array looks fairly verbose, especially when compared to the five lines of Python above.

```
static inline void compute_F(npy_int64 N,
PyArrayObject *py_m,
PyArrayObject *py_r,
PyArrayObject *py_F) {
npy_int64 i, j;
npy_float64 sx, sy, Fx, Fy, s3, tmp;
// Set all forces to zero.
for(i = 0; i < N; ++i) {
F(i, 0) = F(i, 1) = 0;
}
// Compute forces between pairs of bodies.
for(i = 0; i < N; ++i) {
for(j = i + 1; j < N; ++j) {
sx = r(j, 0) - r(i, 0);
sy = r(j, 1) - r(i, 1);
s3 = sqrt(sx*sx + sy*sy);
s3 *= s3 * s3;
tmp = m(i) * m(j) / s3;
Fx = tmp * sx;
Fy = tmp * sy;
F(i, 0) += Fx;
F(i, 1) += Fy;
F(j, 0) -= Fx;
F(j, 1) -= Fy;
}
}
}
```

Notice that we've reduced the range of the inner-loop by using
Newton's third law: forces occur in pairs of equal magnitude and
opposing direction. Unfortunately it's still O(N^{2}).

The final function in the file is the exported `evolve`

function.

```
static PyObject *evolve(PyObject *self, PyObject *args) {
// Declare variables.
npy_int64 N, threads, steps, step, i;
npy_float64 dt;
PyArrayObject *py_m, *py_r, *py_v, *py_F;
// Parse arguments.
if (!PyArg_ParseTuple(args, "ldllO!O!O!O!",
&threads,
&dt,
&steps,
&N,
&PyArray_Type, &py_m,
&PyArray_Type, &py_r,
&PyArray_Type, &py_v,
&PyArray_Type, &py_F)) {
return NULL;
}
// Evolve the world.
for(step = 0; step< steps; ++step) {
compute_F(N, py_m, py_r, py_F);
for(i = 0; i < N; ++i) {
v(i, 0) += F(i, 0) * dt / m(i);
v(i, 1) += F(i, 1) * dt / m(i);
r(i, 0) += v(i, 0) * dt;
r(i, 1) += v(i, 1) * dt;
}
}
Py_RETURN_NONE;
}
```

Here we see how the Python arguments are parsed. In the time-step loop at the bottom of the function, we see the explicit computation of the x and y components of the velocity and position vectors.

It shouldn't be surprising that the C version of the `evolve`

function
is faster than the Python version. On the same i5 desktop mentioned
above, the C implementation of the `evolve`

function is able to step
through 17972 time-steps per second. That's a factor of 70 improvement
over the Python implementation.

Notice that the C code has been kept as simple as possible. Input arguments and output arrays can be type checked and allocated in a Python wrapper function. Removing allocations not only speeds up processing, but removes the potential for memory leaks (or worse) resulting from incorrect reference counts on Python objects.

In the next part of this series, we'll improve on the performance of this implementation by taking advantage of C-contiguous NumPy arrays. After that we'll look at using Intel's SIMD instructions and OpenMP to push the envelope further.

If you have any questions, comments, suggestions, or corrections, please let me know via the contact link.