Home About Blog Contact Software

Previous Next All

High Performance Python Extensions: Part 2

By J. David Lee on 2014-10-21.


Click here to learn about our on-demand, auto scaling, self-healing, cloud batch processing clusters.

Introduction

This post is the second in a series looking at the process of writing C extension modules for Python using the NumPy API. In part 1 we built a simple N-body simulation and found that the bottleneck was computing the mutual forces between bodies, an O(N2) operation. By implementing the time-evolution function in C, we were able to speed up the computation by a factor of about 70.

If you haven't read the first post, you should probably skim it before moving on.

In this post we'll trade some generality in our code for improved performance.

Review

World is a class that stores the state of N bodies. Our simulation will evolve that state over a series of time-steps.

class World(object):
    """World is a structure that holds the state of N bodies and
    additional variables.

    threads : (int) The number of threads to use for multithreaded
              implementations.
    dt      : (float) The time-step.

    STATE OF THE WORLD: 

    N : (int) The number of bodies in the simulation.
    m : (1D ndarray) The mass of each body.
    r : (2D ndarray) The position of each body.
    v : (2D ndarray) The velocity of each body.
    F : (2D ndarray) The force on each body.

    TEMPORARY VARIABLES:

    Ft : (3D ndarray) A 2D force array for each thread's local storage.
    s  : (2D ndarray) The vectors from one body to all others. 
    s3 : (1D ndarray) The norm of each s vector. 

    NOTE: Ft is used by parallel algorithms for thread-local
          storage. s and s3 are only used by the Python
          implementation.
    """
    def __init__(self, N, threads=1, 
                 m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
        self.threads = threads
        self.N  = N
        self.m  = np.random.uniform(m_min, m_max, N)
        self.r  = np.random.uniform(-r_max, r_max, (N, 2))
        self.v  = np.random.uniform(-v_max, v_max, (N, 2))
        self.F  = np.zeros_like(self.r)
        self.Ft = np.zeros((threads, N, 2))
        self.s  = np.zeros_like(self.r)
        self.s3 = np.zeros_like(self.m)
        self.dt = dt

When the simulation begins, N bodies are randomly assigned a mass, m, position, r, and velocity, v. For each time-step, the following computations are made:

  1. The net force, F, on each body due to all other bodies is computed.
  2. The velocity, v, of each body is modified due to the force.
  3. The position, r, of each body is modified due to its velocity.

Accessor Macros

The extension module that we created in part 1 used macros to access the elements of NumPy arrays in C. Here's what one of those macros looks like:

#define r(x0, x1) (*(npy_float64*)((PyArray_DATA(py_r) +                \
                                    (x0) * PyArray_STRIDES(py_r)[0] +   \
                                    (x1) * PyArray_STRIDES(py_r)[1])))

Using a macro like this, we can access the elements of the py_r array using a simple notation like r(i, j). The value indexed will match what you'd see in Python, even if the array has been reshaped or sliced in some way.

For general purpose code, that's what we'd want. In the case of our simulation, we know the pedigree of our arrays: they're contiguous and never sliced or reshaped. We can use this fact to both simplify and improve the performance of our code.

Simple C Extension 2

In this section we'll look at a modified version of our C extension that does away with the accessor macros and works directly with the NumPy array's underlying data. The code for this section, src/simple2.c, is available on github.

For comparison, the implementation from the previous post is available here.

The Evolution Function

Starting at the bottom of the file, we can see that the evolve function parses the Python arguments in the same manner as the previous version, but now we're utilizing the PyArray_DATA macro to obtain a pointer to the underlying memory. We cast this pointer to an npy_float64 which is an alias for double.

static PyObject *evolve(PyObject *self, PyObject *args) {
  // Declare variables. 
  npy_int64 N, threads, steps, step, i, xi, yi;
  npy_float64 dt;
  PyArrayObject *py_m, *py_r, *py_v, *py_F;
  npy_float64 *m, *r, *v, *F;

  // Parse arguments. 
  if (!PyArg_ParseTuple(args, "ldllO!O!O!O!",
                        &threads,
                        &dt,
                        &steps,
                        &N,
                        &PyArray_Type, &py_m,
                        &PyArray_Type, &py_r,
                        &PyArray_Type, &py_v,
                        &PyArray_Type, &py_F)) {
    return NULL;
  }

  // Get underlying arrays from numpy arrays. 
  m = (npy_float64*)PyArray_DATA(py_m);
  r = (npy_float64*)PyArray_DATA(py_r);
  v = (npy_float64*)PyArray_DATA(py_v);
  F = (npy_float64*)PyArray_DATA(py_F);

  // Evolve the world. 
  for(step = 0; step < steps; ++step) {
    compute_F(N, m, r, F);

    for(i = 0; i < N; ++i) {
      // Compute offsets into arrays. 
      xi = 2 * i; 
      yi = xi + 1;

      v[xi] += F[xi] * dt / m[i];
      v[yi] += F[yi] * dt / m[i];

      r[xi] += v[xi] * dt;
      r[yi] += v[yi] * dt;
    }
  }

  Py_RETURN_NONE;
}

The only change from our previous implementation using macros is that we have to compute our array indices explicitly. We could have cast the underlying arrays as two-dimensional npy_float64 arrays, but that would require some upfront cost and memory management.

Computing the Forces

The forces are computed in the same manner as in the previous implementation, the only differences being the notation, and explicit index calculations in the nested for-loops.

static inline void compute_F(npy_int64 N,
                             npy_float64 *m,
                             npy_float64 *r,
                             npy_float64 *F) {
  npy_int64 i, j, xi, yi, xj, yj;
  npy_float64 sx, sy, Fx, Fy, s3, tmp;

  // Set all forces to zero. 
  for(i = 0; i < N; ++i) {
    F[2*i] = F[2*i + 1] = 0;
  }

  // Compute forces between pairs of bodies.
  for(i = 0; i < N; ++i) {
    xi = 2*i;
    yi = xi + 1;
    for(j = i + 1; j < N; ++j) {
      xj = 2*j; 
      yj = xj + 1;

      sx = r[xj] - r[xi];
      sy = r[yj] - r[yi];

      s3 = sqrt(sx*sx + sy*sy);
      s3 *= s3*s3;

      tmp = m[i] * m[j] / s3;
      Fx = tmp * sx;
      Fy = tmp * sy;

      F[xi] += Fx;
      F[yi] += Fy;
      F[xj] -= Fx;
      F[yj] -= Fy;
    }
  }
}

Performance

I was surprised to find that, compared to our original C extension, this implementation is 45% faster, performing 26108 time-steps per second versus the previous version's 17972 on the same Intel i5 desktop. This represents a factor of 101 improvement over the original Python and NumPy implementation.

Using SIMD Instructions

In the implementation above, we calculated vector components in the x and y directions explicitly. If we're willing to give up some portability, we can use x86 SIMD (Single Instruction Multiple Data) instructions to simplify our code, and hopefully speed it up as well.

The code for this section, src/simd1.c, is available on github.

x86 Intrinsics

In code below, we'll be utilizing the vector data type __m128d. The number 128 refers to the number of bits in the vector, while the letter "d" indicates the data type of the vector's components. In this case the components are type double (64 bit floating point). Other vector widths and types are available depending on the architecture.

A variety of intrinsic functions are available for working with vector data types. Intel has a handy reference available on their site.

Portability

I'm working on Debian Wheezy using the GNU gcc compiler. In this environment the header file defining the available intrinsic data types and functions is x86intrin.h. This may not be portable across platforms.

The Evolution Function

The evolve function shown here is more compact that the version above. The two-dimensional arrays are cast to __mm128d pointers, and the use of vectors obviates the need for computing vector components explicitly.

static PyObject *evolve(PyObject *self, PyObject *args) {
  // Variable declarations.
  npy_int64 N, threads, steps, step, i;
  npy_float64 dt;

  PyArrayObject *py_m, *py_r, *py_v, *py_F;
  npy_float64 *m;
  __m128d *r, *v, *F;

  // Parse arguments. 
  if (!PyArg_ParseTuple(args, "ldllO!O!O!O!",
                        &threads,
                        &dt,
                        &steps,
                        &N,
                        &PyArray_Type, &py_m,
                        &PyArray_Type, &py_r,
                        &PyArray_Type, &py_v,
                        &PyArray_Type, &py_F)) {
    return NULL;
  }

  // Get underlying arrays from numpy arrays. 
  m = (npy_float64*)PyArray_DATA(py_m);
  r = (__m128d*)PyArray_DATA(py_r);
  v = (__m128d*)PyArray_DATA(py_v);
  F = (__m128d*)PyArray_DATA(py_F);

  // Evolve the world.
  for(step = 0; step < steps; ++step) {
    compute_F(N, m, r, F);

    for(i = 0; i < N; ++i) {
      v[i] += F[i] * dt / m[i];
      r[i] += v[i] * dt;
    }
  }

  Py_RETURN_NONE;
}

Computing the Forces

The force computation is also more compact than in the previously implementation.

static inline void compute_F(npy_int64 N,
                             npy_float64 *m,
                             __m128d *r,
                             __m128d *F) {
  npy_int64 i, j;
  __m128d s, s2, tmp;
  npy_float64 s3;

  // Set all forces to zero.
  for(i = 0; i < N; ++i) {
    F[i] = _mm_set1_pd(0);
  }

  // Compute forces between pairs of bodies. 
  for(i = 0; i < N; ++i) {
    for(j = i + 1; j < N; ++j) {
      s = r[j] - r[i];

      s2 = s * s;
      s3 = sqrt(s2[0] + s2[1]);
      s3 *= s3 * s3;

      tmp = s * m[i] * m[j] / s3;
      F[i] += tmp;
      F[j] -= tmp;
    }
  }
}

Performance

While this explicitly vectorized code is clearer and more compact than the previous version, it only runs about 1% faster, performing 26427 time-steps per second. It's likely that the compiler was able to optimize the previous implementation using the same vector instructions.

Conclusion

If full generality isn't necessary, often performance gains can be acheived by accessing NumPy arrays' underlying memory directly in C.

While the explicit use of vector intrinsics didn't provide much benefit in this example, the relative performance difference will increase in a multi-core setting using OpenMP.

Next Time

In the next part of this series, we'll parallelize both of the implementations shown here using OpenMP, and see how performance scales using multiple cores.

If you have any questions, comments, suggestions, or corrections, please let me know via the contact link.


The Author

J. David Lee is a programmer turned physicist turned programmer and the proprietor of Crumpington Consulting. If you feel that his expertise could benefit you or your organization, don't hesitate to get in touch.