In today’s world mutli-core architectures are becoming more and more popular, but with a different architectural concept comes a new programming model. Implementing parallel programs with tools that are not specifically designed for parallelism can be very tricky and usually fails after an application starts to grow bigger. OpenMP is one of the leading multi-programming APIs nowadays.

Programmers Motivation

To demonstrate the difficulty of parallelism in coding the simple example shown below has been prepared.

Incrementer: Increments all members of an array.

Given the basic task of incrementing all elements of a large array it is additionally required to add parallelization to make use of existing multi-core hardware. Noticing that all elements are independent of each other, a first step would be to divide the loop into segments. Rather than incrementing each element in sequence, sets of iterations should be formed and incremented in parallel. This can be achieved through threading. Each thread will work on a portion of the array while the operating system will abstract further scheduling details. The concept of distributing the task over n threads yields its own problems already. Many questions arise:

What is the most suitable size for n?

What specific task will each thread carry out (can it be generic)?

Is parallelization always desired?

How will a larger application handle the big amount of threads?

OpenMP solves most of the previously mentioned problems.

What is OpenMP?

Open: Open Specification

M: Multi

P: Processing

OpenMP is an API available for C/C++ and Fortran, which runs on most platforms. It allows the programming of shared memory architectures and aims at code parallelization to improve performance on multi-core computers.

OpenMP was implemented by a set of large hardware manufacturers such as Intel, IBM and Digital. A forum was created in order to exchange ideas for a new standard in shared memory computing, which did not exist by this time. Work designed for a certain platform was not reusable on other platforms. OpenMP was first released in 1997 and is nowadays compatible with most computer platforms.

Main components of OpenMP

Open Extensions: http://en.wikipedia.org

While parallel control structures are used to parallelize some code, work sharing is used for distributing work (e.g. breaking up loops). The data environment defines the scope of variables within parallel regions. Synchronization is a critical element and describes the process of thread ordering/sequencing in a chronological sense. Runtime routines allow an application to interact with the environment, like retrieving platform specific information. The OpenMP architecture has the following structure.

OpenMP Architecture: http://www.lrz-muenchen.de

OpenMP works with threads which are handled by the operating system. Programmers can put directives in their code to indicate parallelization at certain points, while the environment variables determine runtime behaviour (resource allocation, scheduling algorithm, number of threads).

Directives in OpenMP (specifically for C)

The general form of a directive in C is: #pragma omp <rest of pragma>.

Note: pragmas are ignored by unknowing compilers.

Example usage of directives follows:

#pragma omp parallel {}: Parallelization of the code enclosed by the braces.

#pragma omp parallel do/for: Parallelization of a do/for-loop.

Parallel Control

Open MP & Threads: http://cs.calvin.edu

The parallel directive parallelizes a piece of code as show in the example below. By its own it does not distribute the load among the threads, but much rather runs a copy of the code on each available thread.

Hence in the following example the expected output is a message from each thread in the form of “Thread no: x”. If it is required to share the work between threads, work sharing is in order.

Parallel directive in OpenMP

Work Sharing

Incrementer: Implementation in OpenMP

To provide an actual work sharing example, the original incrementer problem has been modified. With the omp parallel for directive, the work is distributed over several threads. This is incredibly powerful, since existing code can be easily modified to run in parallel. Given that pragmas are ignored by compilers that do not support an OpenMP extension, the code can at the same time be interpreted as code written for a single core computer. Work sharing can also be applied on whole blocks of code (other than just loops). These blocks are referred to as sections. The following example shows a server application with three main tasks that require parallelization.

OpenMP Server Example

The server should listen to incoming connections, update the IP address on a dynamic DNS server and capture some streaming video input. Using the section keyword this can be achieved easily.

Loop Scheduling in OpenMP (Load Balancing)

Load Balancing is one of the most important factors in parallel computing. It ensures that all threads stay as busy as possible in order to make use of the computational power that is available. Some iterations within a loop may need significantly more time than other iterations. Normally OpenMP assumes that all iterations of a loop require the same amount of time and hence schedules approximately equally sized chunks to the threads. This approach minimizes the chances of memory conflicts (false sharing), on the other hand it may come at a high cost in terms of load balancing, since a particular set of iterations may require more time than others. The opposite approach yields the opposite advantages and disadvantages. Under normal circumstances the programmer knows which division of the iterations may optimize the execution time (or at least can benchmark it). Therefore it is often desired to decide over how tasks should be scheduled to the available threads. OpenMP supports a set of scheduling directives that allow a programmer to do so. The general form is: schedule(type, chunk size). There are four types of scheduling. STATIC scheduling divides iterations into k pieces, each of size “chunk_size” which are then scheduled to the available threads in a round-robin fashion. This has some severe problems if we consider that some threads may finish earlier but are staying idle because it is not their turn. To overcome this problem DYNAMIC scheduling can be used. It schedules standing by chunks to the first thread that finished work. GUIDED scheduling uses dynamic scheduling but does not divide iterations into equivalently large chunks, instead it divides all remaining iterations by the number of threads and schedules one chunk to one thread then divides the remaining iterations by the number of threads again etc. until the a chunk has reached chunk_size (basically it exponentially decreases to chunk_size). RUNTIME scheduling simply checks for the OMP_SCHEDULE environment variable to decide over the scheduling algorithm.

The data environment

In parallel regions it is important to decide over the scope of variables. It makes significant difference to declare a variable as shared among all threads or as a private variable for each thread. Shared variables can modified and read by all threads. A change made by a thread will affect the value seen by all threads. In some cases this is important, for instance when a variable is used to enable communication between various threads. In contrast to that variables may required to be of private nature to a thread (e.g. a loop counter). Since private variables can be seen as new instances further complications arise: With what value will it be initialized inside every thread and what will be the value of a variable once the parallel region is exited? There are three directives to handle this issue. FIRSTPRIVATE is used to initialize all “copies” of a variable with its value before entering a parallel region. LASTPRIVATE makes sure that the last thread that finishes executing writes its value back into the variable. Sometimes it is required to express the resulting value as a function of the last values within all threads.

To do so the REDUCTION(op:variable) directive can be used, where op is an operator (+, *, -, &, ^, |, && or ||) which does not involve any possible precedence issues and variable is the variable to which the reduction should be applied.

Synchronization

When threads are not synchronized and do not hold a chronological order required by a program, this can have an undesired impact on other threads and hence the entire program. Thread synchronization is a critical part in parallel programming and there are several techniques to properly synchronize them, three of which will be described here. To force only one thread to run at a time the CRITICAL directive can be used. Imagining a problem with multiple threads and a global variable v, for which it is essential that its value is modified and read by multiple threads in a fixed order within the entire program, will make the importance of having an only-one-thread at a time policy quiet comprehensible. Sometimes a certain order is required but with no global nature. Thinking of a loop with flow-dependency (e.g. array[i] = array[i-1] * I), there needs to be sequencing directive with a local characteristic. ORDERED can take care of such a case. It sorts thread execution of loop iterations in the same way they would execute in a sequential program.

To only make threads wait for another after continuing with different work the BARRIER directive can be placed within a parallel region.

Thread count

The number of threads very much depends on the application. When the targeted applications do mainly calculations it is recommended to use as many threads as there are processors. This is reflected by the following graph which demonstrates the performance vs. the thread count on a quad core processor.

Computation of pi; Speedup related to CPU vs. Thread count: http://www.math.utah.edu

I/O on the other hand tends to behave quiet differently in this aspect. The following graph indicates that and increased number of threads does improve throughput of I/O devices.

Process Count vs. throughput: http://www.open-mag.com

Conclusion: The correct thread count depends on the application.

Hybrid model

OpenMP tends to have a serious issue with a great number of threads. An additional problem of OpenMP is that it only runs on shared memory systems.

MPI & OpenMP Hybridization: http://www.erdc.hpc.mil

A model which combines OpenMP with MPI (Message Passing Interface) has evolved recently. It uses the best of both. While MPI supports distributed memory systems and is capable to handle many threads at the same time, OpenMP is easier to implement and to maintain. In the hybrid model OpenMP is used to subdivide a program further into threads to increase performance, while MPI is the backbone of the program.

Pros & Cons of OpenMP

Advantages:

Easy to program / easy to maintain.

Supports parallelization and sequential code execution.

Disadvantages:

Runs only on shared memory computers.

No direct GPU support.

Needs a compiler with OpenMP support.

Fails when using a large number of threads.

Performance of OpenMP vs. other models

This graph shows a benchmark that was tested on an 8-core computer running Linux. When the application is coarse-grained the performance of OpenMP is very satisfying.

Once the same problem is fine-grained, OpenMP fails, which may very well be due to the fact that OpenMP is not good at handling communication between too many threads

The next benchmark shows the Smith-Waterman algorithm for sequence matching. During this benchmark OpenMP always remains at top positions in performance.

Link to the original presentation: Presentation

To learn more about OpenMP, check the following links:

Official OpenMP website, also check out the links provided there.
OpenMP Tutorial , by Blaise Barney, Lawrence Livermore National Laboratory.
cOMPunity , OpenMP community.
OpenMP 3.0 specifications

Advertisements