Home About Blog Contact Software

Previous Next All

High Performance Python Extensions: Part 1

By J. David Lee on 2014-10-19.

Click here to learn about our on-demand, auto scaling, self-healing, cloud batch processing clusters.


It's commonly said that Python isn't a high performance language, and in some sense that's true. However, due to the ecosystem of numeric and scientific packages that has grown around NumPy, reasonable performance can generally be achieved without too much difficulty.

When performance is an issue, run-time is generally dominated by a small number of functions. By rewriting these functions in C, performance can often be drastically improved.

In the first part of this series we'll look at how to write a Python extension in C using NumPy's C API in order to improve the performance of a toy simulation. In future posts we'll build on our solution here to further improve its performance.


The files referenced in this post are available on github.

The Simulation

As a starting point for this exercise, we'll consider a two-dimensional N-body simulation for N bodies under the influence of a gravity-like force.

What follows is a class that will be used to store the state of our world, as well as some temporary variables.

# lib/sim.py

class World(object):
    """World is a structure that holds the state of N bodies and
    additional variables.

    threads : (int) The number of threads to use for multithreaded
    dt      : (float) The time-step. 


    N : (int) The number of bodies in the simulation.
    m : (1D ndarray) The mass of each body.
    r : (2D ndarray) The position of each body.
    v : (2D ndarray) The velocity of each body.
    F : (2D ndarray) The force on each body.


    Ft : (3D ndarray) A 2D force array for each thread's local storage.
    s  : (2D ndarray) The vectors from one body to all others. 
    s3 : (1D ndarray) The norm of each s vector. 

    NOTE: Ft is used by parallel algorithms for thread-local
          storage. s and s3 are only used by the Python
    def __init__(self, N, threads=1, 
                 m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
        self.threads = threads
        self.N  = N
        self.m  = np.random.uniform(m_min, m_max, N)
        self.r  = np.random.uniform(-r_max, r_max, (N, 2))
        self.v  = np.random.uniform(-v_max, v_max, (N, 2))
        self.F  = np.zeros_like(self.r)
        self.Ft = np.zeros((threads, N, 2))
        self.s  = np.zeros_like(self.r)
        self.s3 = np.zeros_like(self.m)
        self.dt = dt

When the simulation begins, N bodies are randomly assigned a mass, m, position, r, and velocity, v. For each time-step, the following computations are made:

  1. The net force, F, on each body due to all other bodies is computed.
  2. The velocity, v, of each body is modified due to the force.
  3. The position, r, of each body is modified due to its velocity.

The frist step, computing F, will be our bottleneck. The force on a single body is the sum of forces due to every other body in the world. This leads to O(N2) complexity. The v and r updates are both O(N).

If you're interested, this Wikipedia article introduces some approximations that can speed up the computation of forces.

Pure Python

An implementation of the time evolution function in pure Python, using NumPy arrays, provides a starting point for optimization, and a reference against which to test other implementations.

# lib/sim.py

def compute_F(w):
    """Compute the force on each body in the world, w."""
    for i in xrange(w.N):
        w.s[:] = w.r - w.r[i]
        w.s3[:] = (w.s[:,0]**2 + w.s[:,1]**2)**1.5
        w.s3[i] = 1.0 # This makes the self-force zero.
        w.F[i] = (w.m[i] * w.m[:,None] * w.s / w.s3[:,None]).sum(0)

def evolve(w, steps):
    """Evolve the world, w, through the given number of steps."""
    for _ in xrange(steps):
        w.v += w.F * w.dt / w.m[:,None]
        w.r += w.v * w.dt

The O(N2) nature of the net force calculation is masked by NumPy's array notation. Each array operation performs a loop over the array elements.


Here's a plot of the paths of seven bodies evolving from a random initial state:

State Machine Diagram


In order to benchmark this implementation, we create a script in the project directory containing the following:

import lib
w = lib.World(101)
lib.evolve(w, 4096)

We benchmark the script using the cProfile module.

python -m cProfile -scum bench.py

The first few lines show us that compute_F is indeed our bottleneck, accounting for over 99% of the run-time.

428710 function calls (428521 primitive calls) in 16.836 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   16.837   16.837 bench.py:2(<module>)
        1    0.062    0.062   16.756   16.756 sim.py:60(evolve)
     4096   15.551    0.004   16.693    0.004 sim.py:51(compute_F)
   413696    1.142    0.000    1.142    0.000 {method 'sum' ...
        3    0.002    0.001    0.115    0.038 __init__.py:1(<module>)

With 101 bodies on an Intel i5 desktop, this implementation is able to evolve the world through 257 time-steps per second.

Simple C Extension 1

In this section we'll look at a C extension module implementing the evolve function. It may be helpful to have a copy of the C file while going through this section. The file, src/simple1.c, is available on github.

For additional documentation on NumPy's C API, see the NumPy Reference. Python's C API is documented in detail here.


The first thing in the file is a forward declaration of our evolve function. This will be used directly below in the method list.

static PyObject *evolve(PyObject *self, PyObject *args); 

The next thing is the method list.

static PyMethodDef methods[] = {
  { "evolve", evolve, METH_VARARGS, "Doc string."},
  { NULL, NULL, 0, NULL } /* Sentinel */

This is a list of exported methods for the extension module. Here we only have one method named evolve.

The last bit of boilerplate is the module initialization.

PyMODINIT_FUNC initsimple1(void) {
  (void) Py_InitModule("simple1", methods);

Note that the name in initsimple1 has to match the first argument to Py_InitModule as shown here. The import_array call is necessary for any extension using the NumPy API.

Array Access Macros

The array access macros can be used to correctly index into an array regardless of how it's been reshaped or sliced. These macros also make the code following them much more readable.

#define m(x0) (*(npy_float64*)((PyArray_DATA(py_m) +                \
                                (x0) * PyArray_STRIDES(py_m)[0])))
#define m_shape(i) (py_m->dimensions[(i)])

#define r(x0, x1) (*(npy_float64*)((PyArray_DATA(py_r) +                \
                                    (x0) * PyArray_STRIDES(py_r)[0] +   \
                                    (x1) * PyArray_STRIDES(py_r)[1])))
#define r_shape(i) (py_r->dimensions[(i)])

Here we see accessor macros for a one and two dimensional array. Arrays with higher dimensionality can be accessed in a similar fashion.

With the help of these macros, we can loop through r using the following code:

for(i = 0; i < r_shape(0); ++i) {
  for(j = 0; j < r_shape(1); ++j) {
    r(i, j) = 0; // Zero all elements. 

A Note on Naming

The macros defined above will only work if the matching NumPy array objects are defined with the correct names. In the code above, the arrays are named py_m and py_r. In order to use the same macros across different functions, the names of the NumPy arrays need to be kept consistent.

Computing the Forces

The function that computes the force array looks fairly verbose, especially when compared to the five lines of Python above.

static inline void compute_F(npy_int64 N,
                             PyArrayObject *py_m,
                             PyArrayObject *py_r,
                             PyArrayObject *py_F) {
  npy_int64 i, j;
  npy_float64 sx, sy, Fx, Fy, s3, tmp;

  // Set all forces to zero. 
  for(i = 0; i < N; ++i) {
    F(i, 0) = F(i, 1) = 0;

  // Compute forces between pairs of bodies.
  for(i = 0; i < N; ++i) {
    for(j = i + 1; j < N; ++j) {
      sx = r(j, 0) - r(i, 0);
      sy = r(j, 1) - r(i, 1);

      s3 = sqrt(sx*sx + sy*sy);
      s3 *= s3 * s3;

      tmp = m(i) * m(j) / s3;
      Fx = tmp * sx;
      Fy = tmp * sy;

      F(i, 0) += Fx;
      F(i, 1) += Fy;

      F(j, 0) -= Fx;
      F(j, 1) -= Fy;

Notice that we've reduced the range of the inner-loop by using Newton's third law: forces occur in pairs of equal magnitude and opposing direction. Unfortunately it's still O(N2).

The Evolution Function

The final function in the file is the exported evolve function.

static PyObject *evolve(PyObject *self, PyObject *args) {

  // Declare variables. 
  npy_int64 N, threads, steps, step, i;
  npy_float64 dt;
  PyArrayObject *py_m, *py_r, *py_v, *py_F;

  // Parse arguments. 
  if (!PyArg_ParseTuple(args, "ldllO!O!O!O!",
                        &PyArray_Type, &py_m,
                        &PyArray_Type, &py_r,
                        &PyArray_Type, &py_v,
                        &PyArray_Type, &py_F)) {
    return NULL;

  // Evolve the world. 
  for(step = 0;  step< steps; ++step) {
    compute_F(N, py_m, py_r, py_F);

    for(i = 0; i < N; ++i) {
      v(i, 0) += F(i, 0) * dt / m(i);
      v(i, 1) += F(i, 1) * dt / m(i);

      r(i, 0) += v(i, 0) * dt;
      r(i, 1) += v(i, 1) * dt;


Here we see how the Python arguments are parsed. In the time-step loop at the bottom of the function, we see the explicit computation of the x and y components of the velocity and position vectors.


It shouldn't be surprising that the C version of the evolve function is faster than the Python version. On the same i5 desktop mentioned above, the C implementation of the evolve function is able to step through 17972 time-steps per second. That's a factor of 70 improvement over the Python implementation.


Notice that the C code has been kept as simple as possible. Input arguments and output arrays can be type checked and allocated in a Python wrapper function. Removing allocations not only speeds up processing, but removes the potential for memory leaks (or worse) resulting from incorrect reference counts on Python objects.

Next Time

In the next part of this series, we'll improve on the performance of this implementation by taking advantage of C-contiguous NumPy arrays. After that we'll look at using Intel's SIMD instructions and OpenMP to push the envelope further.

If you have any questions, comments, suggestions, or corrections, please let me know via the contact link.

The Author

J. David Lee is a programmer turned physicist turned programmer and the proprietor of Crumpington Consulting. If you feel that his expertise could benefit you or your organization, don't hesitate to get in touch.