C++17 has introduced parallelized versions of many standard algorithms like count, count_if, copy and many others. Some algorithms like accumulate and inner_product, got parallelized siblings reduce and transform_reduce which do not process the collections in-order.

While GCC and Clang have a really great support for other parts of the C++17 standard, the implementation of parallelized algorithms is missing in both libstdc++ and libc++ (it is being worked on).

Fortunately, there are 3rd-party libraries like HPX and Intel’s PSTL library that can be used until GCC and Clang standard library developers catch up.

Setting up PSTL

The latest release of PSTL can be downloaded from intel/parallelstl. You can extract it anywhere you like, I keep it in /usr/local/intel-pstl (you can choose whichever install prefix you want, I tend to keep everything that is not installed through APT in separate directories in /usr/local). Also, you will need Intel TBB2018. If your distribution does not provide an up-to-date version (Debian, for example, has TBB2017 which is not sufficient to run PSTL) you can get it from p01orgu/tbb.

If you downloaded TBB from github, do not forget to set the LD_LIBRARY_PATH environment variable to include the directory where the .so files are located. In my case, the directory is /usr/local/intel-tbb/lib/intel64/gcc4.7 (it works with new versions of GCC, not only 4.x).

The last step is to add a few compiler and linker flags for your project, and you are ready to speed up your computations. For the compiler flags, you need to add the include dirs and to enable OpenMP SIMD. For GCC, it means doing this (just change the paths to the include directories):

CXXFLAGS = -std=c++17 -I/usr/local/intel-tbb/include \
           -I/usr/local/intel-pstl/include -fopenmp-simd

As far as the linker goes, it just needs to know it needs to link the tbb library:

LDFLAGS = -L/usr/local/intel-tbb/lib/intel64/gcc4.7 -ltbb

Setting up HPX

Setting up the HPX library is a bit simpler. You can get it library by cloning git://github.com/STEllAR-GROUP/hpx.git (you are advised to do a --depth 1 clone if you do not plan to contribute to HPX). After that, just do an out-of-the-source build using cmake. The only prerequisites for building HPX are boost and hwloc libraries, so make sure you have their development packages installed. Optionally, you can install tcmalloc or jemalloc development packages to provide efficient multi-threading malloc for HPX to use, or, alternatively, you can pass the -DHPX_WITH_MALLOC=system flag to CMake.

mkdir build
cd !$
cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local/hpx -DHPX_WITH_MALLOC=jemalloc
make -j4

When compiling a program, you just need to link against the hpx library (and set the appropriate include and library paths like it was the case with PSTL).

Usage

Both libraries provide a similar API – based on the C++17 standard. They differ in the namespace they use and in which headers you need to include. The PSTL library is meant to be used as a part of the standard library, so it uses std:: namespace while HPX is a proper 3rd-party library and defines the parallel algorithms and policies in the hpx::parallel namespace. These differences can be easily remedied by defining a std_par macro to point to the right namespace for the library you want to use.

#include <iostream>
#include <vector>

#ifdef USE_HPX

    #include <hpx/hpx_init.hpp>
    #include <hpx/hpx.hpp>
    #include <hpx/include/parallel_numeric.hpp>
    #include <hpx/include/parallel_algorithm.hpp>
    #include <hpx/parallel/algorithms/fill.hpp>

    #define std_par hpx::parallel

#elif USE_INTEL_PSTL

    #include <pstl/execution>
    #include <pstl/numeric>
    #include <pstl/algorithm>

    #define std_par std

#else

    // Use the standard library implementation
    #include <execution>
    #include <numeric>
    #include <algorithm>

    #define std_par std

#endif

int main(int argc, char *argv[])
{
    using std_par::execution::par;

    std::vector<int> xs(100000000);

    std_par::fill(par, std::begin(xs), std::end(xs), 42);
    std::cout
        << std_par::reduce(par, std::begin(xs), std::end(xs), 0)
        << std::endl;
}

Speed

It is good that we have two separate implementations of parallel STL algorithms to choose from. If you use only the algorithms defined by the C++17 standard, it will be easy to switch from one implementation to another to find the one most efficient for your particular use-case.

The HPX library seems to be significantly faster on my system (not Intel-based) with GCC than Intel’s PSTL, but your mileage might vary.