Home About Blog Contact Software

Previous Next All

High Performance Python Extensions: Part 3

By J. David Lee on 2014-10-24.

Click here to learn about our on-demand, auto scaling, self-healing, cloud batch processing clusters.


This post is the third in a series that looks at writing a high performance C extension module for Python using the NumPy API. In this post we'll use OpenMP to parallelize the implementations developed in part 2.


World is a class that stores the state of N bodies. Our simulation will evolve that state over a series of time-steps.

class World(object):
    """World is a structure that holds the state of N bodies and
    additional variables.

    threads : (int) The number of threads to use for multithreaded
    dt      : (float) The time-step. 


    N : (int) The number of bodies in the simulation.
    m : (1D ndarray) The mass of each body.
    r : (2D ndarray) The position of each body.
    v : (2D ndarray) The velocity of each body.
    F : (2D ndarray) The force on each body.


    Ft : (3D ndarray) A 2D force array for each thread's local storage.
    s  : (2D ndarray) The vectors from one body to all others. 
    s3 : (1D ndarray) The norm of each s vector. 

    NOTE: Ft is used by parallel algorithms for thread-local
          storage. s and s3 are only used by the Python
    def __init__(self, N, threads=1, 
                 m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
        self.threads = threads
        self.N  = N
        self.m  = np.random.uniform(m_min, m_max, N)
        self.r  = np.random.uniform(-r_max, r_max, (N, 2))
        self.v  = np.random.uniform(-v_max, v_max, (N, 2))
        self.F  = np.zeros_like(self.r)
        self.Ft = np.zeros((threads, N, 2))
        self.s  = np.zeros_like(self.r)
        self.s3 = np.zeros_like(self.m)
        self.dt = dt

When the simulation begins, N bodies are randomly assigned a mass, m, position, r, and velocity, v. For each time-step, the following computations are made:

  1. The net force, F, on each body due to all other bodies is computed.
  2. The velocity, v, of each body is modified due to the force.
  3. The position, r, of each body is modified due to its velocity.

Computing the Forces: Serial Code

Below is the function compute_F from the implementation developed in the previous post (full source here). This function computes the mutual forces between each pair of bodies in the simulation, and has O(N2) complexity.

static inline void compute_F(npy_int64 N,
                             npy_float64 *m,
                             __m128d *r,
                             __m128d *F) {
  npy_int64 i, j;
  __m128d s, s2, tmp;
  npy_float64 s3;

  // Set all forces to zero.
  for(i = 0; i < N; ++i) {
    F[i] = _mm_set1_pd(0);

  // Compute forces between pairs of bodies. 
  for(i = 0; i < N; ++i) {
    for(j = i + 1; j < N; ++j) {
      s = r[j] - r[i];

      s2 = s * s;
      s3 = sqrt(s2[0] + s2[1]);
      s3 *= s3 * s3;

      tmp = s * m[i] * m[j] / s3;
      F[i] += tmp;
      F[j] -= tmp;


Using this serial implementation, our code is able to step through 26427 time-steps per second on an Intel i5 desktop.

Computing the Forces: Parallel Code

Below is a re-implementation of compute_F using OpenMP to parallelize the loops. The full source file, src/simdomp1.c, is available on github.

static inline void compute_F(npy_int64 threads,
                             npy_int64 N,
                             npy_float64 *m,
                             __m128d *r,
                             __m128d *F, 
                             __m128d *Ft) {
  npy_int64 id, i, j, Nid;
  __m128d s, s2, tmp;
  npy_float64 s3;

#pragma omp parallel private(id, i, j, s, s2, s3, tmp, Nid) 
    id = omp_get_thread_num();
    Nid = N * id; // Zero-index in thread-local array Ft. 

    // Zero out the thread-local force arrays. 
    for(i = 0; i < N; i++) {
      Ft[Nid + i] = _mm_set1_pd(0);

    // Compute forces between pairs of bodies.
#pragma omp for schedule(dynamic, 8) 
    for(i = 0; i < N; ++i) {
      F[i] = _mm_set1_pd(0);

      for(j = i + 1; j < N; ++j) {

        s = r[j] - r[i];
        s2 = s * s;
        s3 = sqrt(s2[0] + s2[1]);
        s3 *= s3 * s3;

        tmp = s * m[i] * m[j] / s3;

        Ft[Nid + i] += tmp;
        Ft[Nid + j] -= tmp;

    // Sum the thread-local forces computed above.
#pragma omp for 
    for(i = 0; i < N; ++i) {
      for(id = 0; id < threads; ++id) {
        F[i] += Ft[N*id + i];

Thread-local Storage

The most noticeable difference between the parallel and serial versions is the appearance of the Ft array. The Ft array provides thread-local storage for the forces computed by each concurrently running thread. These local arrays are then summed in a separate loop to produce the array of forces, F.

Local storage is necessary to avoid a race condition. Imagine, for example, that one thread is executing with i=0 and j=1, while another is using i=1 and j=2. In our algorithm, they may both attempt to modify F[1], causing a race.

You can read about race conditions on Wikipedia.


The OpenMP API is implemented using functions such as omp_get_thread_num, and directives such as #pragma omp parallel.

In the code above we create a number of threads with the #pragma omp parallel directive. The private directive lists variables that are private to each thread. All other variables are shared.

The #pragma omp for directive allows the for-loop to proceed in parallel, with each thread processing different indices. We use the schedule(dynamic, 8) directive to tell the compiler to use the dynamic scheduler with a chunk size of 8. The dynamic scheduler is useful when the time spent in each loop iteration can vary, as it does in this case.

Note that there is an implicit barrier at the end of each parallel for-loop. All threads must complete the loop before any can advance.

See the OpenMP website for more information on the API.


Because we've added an extra for-loop to sum the thread-local force arrays, we can expect the single-core performance to suffer slightly. In my tests the OpenMP version runs about 3% slower than the version without OpenMP when using a single thread.

With four cores and four threads on an Intel i5 desktop, the OpenMP implementation using SIMD instructions steps through 80342 time-steps per second. This is a factor of 312 times faster than our original implementation in Python using NumPy, and about 3 times faster than our fastest serial implementation.


i5 OMP Scaling

The plot above shows how the OpenMP code scales with the number of threads on a quad-core i5 desktop. The points labeled "OMP+SIMD" correspond to the implementation described in this post. A similar version, available here, without explicit vector instructions is labeled "OMP". The original version using Python and NumPy is shown for comparison (lower left).

The performance scales well up to the number of physical cores, and drops off sharply as additional threads are added. Note that this isn't true in general, so it's always a good idea to benchmark your code with different numbers of threads.

Environment Variables

There are a variety of environment variables that can change the behavior of programs using OpenMP. These can have a substantial affect on performance, particularly on machines with multiple physical processors.

See the GNU libgomp documentation for details.

Performance Summary: All Implementations

Here are the benchmarks for all of the implementations we created in this series of posts. The benchmarks were run on a quad-core Intel i5 desktop under Debian Wheezy using gcc 4.7.2.

| Implementation   | steps / second | Speed-up |
| ---------------- | -------------- | -------- |
| Python+Numpy     |            257 |        1 |
| Simple C 1       |          17974 |       70 |
| Simple C 2       |          26136 |      102 |
| SIMD             |          26423 |      103 |
| OMP 4 cores      |          76657 |      298 |
| OMP+SIMD 4 cores |          80342 |      313 |

Additional Tests

DigitalOcean 20-Core VM

I ran the same benchmarks on a DigitalOcean 20-core instance, their largest. The best performance was about factor of 3 worse than when running with four threads on the quad-core i5 desktop above. Whether the difference is due to virtualization or shared cores, I don't know.

DigitalOcean 20-Core OMP Scaling

RunAbove 176-Thread VM

RunAbove had a promotion for a free month of time on their IBM Power 8 architecture. Just for fun I ran the same benchmarks on their 176-thread instance. I'd never used the PowerPC architecture before, and was happy that the code ran without modification.

RunAbove 176-Thread OMP Scaling


Matthew Honnibal submitted an implementation in Cython, available here. This implementation appears to have performance similar to our initial C implementation from part 1.


When performance is critical, a C extension can drastically improve performance. Using SIMD instructions can benefit performance and simplify coding but at the cost of portability. OpenMP support can be added to existing code without much work, and can improve performance on multi-core systems.

If you have any questions, comments, suggestions, or corrections, please let me know via the contact link.

The Author

J. David Lee is a programmer turned physicist turned programmer and the proprietor of Crumpington Consulting. If you feel that his expertise could benefit you or your organization, don't hesitate to get in touch.