C++17 has introduced parallelized versions of many standard
algorithms like count
, count_if
,
copy
and many others. Some algorithms like
accumulate
and inner_product
, got parallelized
siblings reduce
and transform_reduce
which do
not process the collections in-order.
While GCC and Clang have a really great support for other parts of
the C++17 standard, the implementation of parallelized algorithms is
missing in both libstdc++
and libc++
(it is
being worked on).
Fortunately, there are 3rd-party libraries like HPX and Intel’s PSTL library that can be used until GCC and Clang standard library developers catch up.
Setting up PSTL
The latest release of PSTL can be downloaded from intel/parallelstl. You
can extract it anywhere you like, I keep it in
/usr/local/intel-pstl
(you can choose whichever install
prefix you want, I tend to keep everything that is not installed through
APT in separate directories in /usr/local
). Also, you will
need Intel TBB2018. If your distribution does not provide an up-to-date
version (Debian, for example, has TBB2017 which is not sufficient to run
PSTL) you can get it from p01orgu/tbb.
If you downloaded TBB from github, do not forget to set the
LD_LIBRARY_PATH
environment variable to include the
directory where the .so
files are located. In my case, the
directory is /usr/local/intel-tbb/lib/intel64/gcc4.7
(it
works with new versions of GCC, not only 4.x
).
The last step is to add a few compiler and linker flags for your project, and you are ready to speed up your computations. For the compiler flags, you need to add the include dirs and to enable OpenMP SIMD. For GCC, it means doing this (just change the paths to the include directories):
CXXFLAGS = -std=c++17 -I/usr/local/intel-tbb/include \
-I/usr/local/intel-pstl/include -fopenmp-simd
As far as the linker goes, it just needs to know it needs to link the
tbb
library:
LDFLAGS = -L/usr/local/intel-tbb/lib/intel64/gcc4.7 -ltbb
Setting up HPX
Setting up the HPX library is a bit simpler. You can get it library
by cloning git://github.com/STEllAR-GROUP/hpx.git
(you are
advised to do a --depth 1
clone if you do not plan to
contribute to HPX). After that, just do an out-of-the-source build using
cmake. The only prerequisites for building HPX are boost and hwloc
libraries, so make sure you have their development packages installed.
Optionally, you can install tcmalloc
or
jemalloc
development packages to provide efficient
multi-threading malloc
for HPX to use, or, alternatively,
you can pass the -DHPX_WITH_MALLOC=system
flag to
CMake.
mkdir build
cd !$
cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local/hpx -DHPX_WITH_MALLOC=jemalloc
make -j4
When compiling a program, you just need to link against the hpx library (and set the appropriate include and library paths like it was the case with PSTL).
Usage
Both libraries provide a similar API – based on the C++17 standard.
They differ in the namespace they use and in which headers you need to
include. The PSTL
library is meant to be used as a
part of the standard library, so it uses std::
namespace while HPX
is a proper 3rd-party library and
defines the parallel algorithms and policies in the
hpx::parallel
namespace. These differences can be easily
remedied by defining a std_par
macro to point to the right
namespace for the library you want to use.
#include <iostream>
#include <vector>
#ifdef USE_HPX
#include <hpx/hpx_init.hpp>
#include <hpx/hpx.hpp>
#include <hpx/include/parallel_numeric.hpp>
#include <hpx/include/parallel_algorithm.hpp>
#include <hpx/parallel/algorithms/fill.hpp>
#define std_par hpx::parallel
#elif USE_INTEL_PSTL
#include <pstl/execution>
#include <pstl/numeric>
#include <pstl/algorithm>
#define std_par std
#else
// Use the standard library implementation
#include <execution>
#include <numeric>
#include <algorithm>
#define std_par std
#endif
int main(int argc, char *argv[])
{
using std_par::execution::par;
std::vector<int> xs(100000000);
std_par::fill(par, std::begin(xs), std::end(xs), 42);
std::cout
<< std_par::reduce(par, std::begin(xs), std::end(xs), 0)
<< std::endl;
}
Speed
It is good that we have two separate implementations of parallel STL algorithms to choose from. If you use only the algorithms defined by the C++17 standard, it will be easy to switch from one implementation to another to find the one most efficient for your particular use-case.
The HPX library seems to be significantly faster on my system (not Intel-based) with GCC than Intel’s PSTL, but your mileage might vary.