Convert OpenMP parallelization to OneAPI::TBB #6626

dbs4261 · 2024-01-27T03:35:43Z

OpenMP acceleration has been migrated to use oneapi::TBB.

Type

Bug fix (non-breaking change which fixes an issue): Fixes #
New feature (non-breaking change which adds functionality). Resolves #
Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #N/A

Motivation and Context

Many components of Open3D imply an eventual shift away from OpenMP to TBB. This includes some sections where tbb is only used on one platform as 2D loop unrolling isn't supported on Win32. Lastly, by using multiple parallelization paradigms, nested parallelism is problematic. When using some Open3D methods from a TBB context, an OpenMP thread pool is created for each TBB thread.

Checklist:

I have run python util/check_style.py --apply to apply Open3D code style
to my code.
This PR changes Open3D behavior or adds new functionality.
- Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
  updated accordingly.
- I have added or updated C++ and / or Python unit tests OR included test
  results (e.g. screenshots or numbers) here.
I will follow up and update the code if CI fails.
For fork PRs, I have selected Allow edits from maintainers.

Description

Updated parallel for sections to use tbb::parallel_for. Adapted most loops that performed reductions with either omp reduction clauses or with critical sections to tbb::parallel_reduce implementations. Some of which required custom reduction objects instead of using lambdas. Added an atomic version of the ProgressBar for use with TBB.

There is still work to be done in documentation. This will break any user code that directly uses ParallelForCPU as OpenMP critical sections will no longer work. Additionally, TBB has no approach for setting the maximum number of threads like OpenMP does with OMP_NUM_THREADS. In C++ code a tbb::global_control object could be used, but it is unclear to how to provide that sort of functionality for python users.

update-docs · 2024-01-27T03:35:47Z

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

dbs4261 · 2024-01-27T18:18:30Z

Ok, I ran my tests in my development environment. Guess I should use the docker containers to replicate the CI environment and figure out those tests.

ssheorey · 2024-01-30T05:01:06Z

Hi @dbs4261 thanks for picking this up!

Possibly fixes #6544

errissa · 2024-02-06T19:23:55Z

@dbs4261 Thanks for working on this! I just tested this PR on my Mac and got numerous TBB related compilation errors. I tried using the Homebrew version of TBB as well as the "build from source" configuration. There appear to be functions that this PR uses that are missing from the Homebrew and "build from source" versions of TBB on Mac.

I know this PR is still draft but wanted to report what I had found. Please let me know if you need any help testing/diagnosing issues on Mac.

dbs4261 · 2024-02-06T20:57:13Z

Hi @ssheorey this PR likely wont fix that issue as I haven't yet changed how the TBB dependency is being accessed. This is likely also why @errissa is facing issues building on Mac.

@errissa is homebrew pulling the OneAPI version of TBB? If you can provide me with the version of TBB you tried and the compiler errors I can take a look and figure out which version is required and work that into the PR.

ssheorey · 2024-02-06T23:58:21Z

@dbs4261 yes, you are right about not fixing #6544. We should update to the latest oneTBB as part of this PR to fix that though.

This is the latest version of oneTBB and is available for all platforms on github:

https://github.com/oneapi-src/oneTBB/releases/tag/v2021.11.0

The naming is off - this was released in Nov 2023.

I think this should also resolve @errissa 's issues on macOS.

dbs4261 · 2024-02-07T01:07:31Z

I agree that setting the version requirement for TBB should be part of this PR. Based on the ubuntu failure in CI, its the collaborative_call_once header that is missing. The TBB repo says the header hasnt been modified in 3 years, so I would think that any version that reports 2021+ should be fine. What does Open3D CI currently use for TBB?

errissa · 2024-02-07T02:16:50Z

@dbs4261 @ssheorey is correct about the oneTBB version. Homebrew's most recent version is 2021.11.0 so if this PR builds successfully against it, it would solve the MacOS issue I experienced.

dbs4261 · 2024-02-07T18:48:21Z

Looks like the minimum version requirement for collaborative_call_once.h is v2021.4.0. Now it looks like we aren't putting version requirements in the find package scripts in 3rdparty/find_dependencies.cmake, this means that error's like @errissa can still happen when using the system library. This raises the question of if I should set the system version requirement to the same version that I am providing in the ExternalProject_add call, or if I should add the newest version but set the system requirement to the minimal version.

ssheorey · 2024-02-10T00:02:32Z

Hi @dbs4261 , our usual policy is to upgrade to the latest version available, but set minimum version to what is required to make everything work. This helps to "future-proof" the updated code as much as possible by incorporating the latest bugfixes. Official binaries will be built with the latest version, but also allows the library to build on older versions by users.

ssheorey

[Initial look]

ssheorey · 2024-02-24T04:21:37Z

3rdparty/mkl/tbb.cmake

@@ -26,13 +26,10 @@ find_package(Git QUIET REQUIRED)
 ExternalProject_Add(
    ext_tbb
    PREFIX tbb
-    URL https://github.com/wjakob/tbb/archive/141b0e310e1fb552bdca887542c9c1a8544d6503.tar.gz # Sept 2020
-    URL_HASH SHA256=bb29b76eabf7549660e3dba2feb86ab501469432a15fb0bf2c21e24d6fbc4c72
+    URL https://github.com/oneapi-src/oneTBB/archive/refs/tags/v2021.4.0.tar.gz


Can we upgrade to the latest? v2021.11.0

No reason why not. I just put in the older version that had all the features I used.

ssheorey · 2024-02-24T04:22:42Z

cpp/open3d/geometry/PointCloudSegmentation.cpp

There's a merge conflict here. The CI can run only after it's fixed.

ssheorey · 2024-02-24T04:25:34Z

cpp/open3d/core/ParallelFor.h

-        func(i);
-    }
+    tbb::parallel_for(tbb::blocked_range<int64_t>(0, n, 32),
+                      [&func](const tbb::blocked_range<int64_t>& range) {


How many threads will be used here? Currently, it's estimated with utility::EstimateMaxThreads() which gives us one thread per core (excluding hyperthreading).

Also, avoid using "magic numbers" (32). I think you have a GetDefaultChunkSize() function.

It will use up to the number of threads in task arena that called it. As for the chunk size, see my other comment.

ssheorey · 2024-02-27T15:51:05Z

cpp/open3d/utility/Parallel.cpp

-        return "";
-    }
-}
+int EstimateMaxThreads() { return tbb::this_task_arena::max_concurrency(); }


Can we use the number of cores (not number of HW threads)?

No, the number of tasks is determined by the caller. A caller could be using a small task arena to deal with IO, while a larger arena deals with processing something else. This actually brings up an issue that I don't yet know how to solve. TBB sets the maximum concurrency with a C++ variable that follows scope rules but doesn't need to be passed to functions. So I don't know how a python user would set the concurrency limit yet. I think it might need to be done with some sort of context manager. But I guess this change behavior in an environment where the number of threads was limited with the OpenMP environment variable.

ssheorey · 2024-02-27T15:58:59Z

cpp/open3d/utility/Parallel.cpp

-    return 1;
-#endif
+std::size_t& DefaultGrainSizeTBB() noexcept {
+    static std::size_t GrainSize = 256;


Can you comment on how this value was selected? Did you see any performance differences for this value versus other values?

Honestly, I was guessing at grain size from this, but it really should be picked based off of profiling. My understanding is that the grain size provides loose guidance to TBB's automatic chunking mechanism. It works similarly to omp schedule(guided). Overall the goal is to provide plenty of work to each thread so the overhead of chunking is minimized, but small enough chunks that the scheduler can go back in a steal some if one of the threads gets held up. It might be worth taking another pass through the grain sizes that I put in and set them as a magic number times the DefaultGrainSizeTBB (which is mutable). That way the chunk size could be higher for doing a single operation with tensors, and smaller when looping through complex sections like in RANSAC.

ssheorey · 2024-03-07T22:00:34Z

[Notes about linking and binary distribution]

For linking TBB, recommendation is to link dynamically. For C++ binaries and applications, we will distribute TBB DLL along with the Open3D DLL.
uxlfoundation/oneTBB#646

For Python, TBB libraries are available through PyPI, so we can add these as dependencies to requirements.txt
https://community.intel.com/t5/Intel-oneAPI-Threading-Building/How-to-ship-a-package-using-TBB-on-PyPI-manylinux/m-p/1227574

benjaminum · 2024-03-15T18:18:10Z

cpp/open3d/utility/ProgressBar.h

@@ -15,30 +18,57 @@ namespace utility {
 class ProgressBar {
 public:
    ProgressBar(size_t expected_count,
-                const std::string &progress_info,
+                std::string progress_info,


Why has the const been removed here?

It has to be copied into the object, so it passed by value into the constructor and then by move into the member variable.

…gh not using the oneapi scope). Untested but building.

…in conjunction with tbb parallel constructs.

…of self intersecting triangles.

…he progress bar into its own function and a bulk inplace add function operator+=. Also added TBBProgressBar. It does not inherit from ProgressBar as it uses an atomic for counting and has slightly different internals to use that atomicity.

…gress bar to limit spinning on the mutex.

…ase codacy

…versions of format wont automatically convert it to its underlying type.

…::spin_mutex::scoped_lock.

… terminal.

…e done to prevent assignment to the output pointer.

…std::mutex

… global mutex from utilities::random.

…e TBB types.

PKizzle · 2024-10-24T09:29:04Z

Is there anything that can be done to fix the two failing tests?

_______________________ test_get_surface_area[device0] ________________________

device = CPU:0

    @pytest.mark.parametrize("device", list_devices())
    def test_get_surface_area(device):
        # Test with custom parameters.
        cube = o3d.t.geometry.TriangleMesh.create_box(float_dtype=o3c.float64,
                                                      int_dtype=o3c.int32,
                                                      device=device)
        np.testing.assert_equal(cube.get_surface_area(), 6)
    
        empty = o3d.t.geometry.TriangleMesh(device=device)
        empty.get_surface_area()
        np.testing.assert_equal(empty.get_surface_area(), 0)
    
        # test noncontiguous
        sphere = o3d.t.geometry.TriangleMesh.create_sphere(device=device)
        area1 = sphere.get_surface_area()
        sphere.vertex.positions = sphere.vertex.positions.T().contiguous().T()
        sphere.triangle.indices = sphere.triangle.indices.T().contiguous().T()
        area2 = sphere.get_surface_area()
>       np.testing.assert_almost_equal(area1, area2)

python\test\t\geometry\test_trianglemesh.py:859: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (12.501888275146484, 12.501884460449[219](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:220)), kwds = {}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Arrays are not almost equal to 7 decimals
E            ACTUAL: 12.501888275146484
E            DESIRED: 12.501884460449219

C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\contextlib.py:81: AssertionError

_______________________________ test_color_map ________________________________

    def test_color_map():
        """
        Hard-coded values are from the 0.12 release. We expect the values to match
        exactly when OMP_NUM_THREADS=1. If more threads are used, there could be
        some small numerical differences.
        """
        o3d.utility.set_verbosity_level(o3d.utility.VerbosityLevel.Debug)
    
        # Load dataset
        mesh, rgbd_images, camera_trajectory = load_fountain_dataset()
    
        # Computes averaged color without optimization, for debugging
        mesh, camera_trajectory = o3d.pipelines.color_map.run_rigid_optimizer(
            mesh, rgbd_images, camera_trajectory,
            o3d.pipelines.color_map.RigidOptimizerOption(maximum_iteration=0))
        vertex_mean = np.mean(np.asarray(mesh.vertex_colors), axis=0)
        extrinsic_mean = np.array(
            [c.extrinsic for c in camera_trajectory.parameters]).mean(axis=0)
>       np.testing.assert_allclose(vertex_mean,
                                   np.array([0.40322907, 0.37276872, 0.543[75](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:76)919]),
                                   rtol=1e-5)

python\test\test_color_map_optimization.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (<function assert_allclose.<locals>.compare at 0x0000027AB5A5F240>, array([0.42966498, 0.39627099, 0.5[76](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:77)62147]), array([0.40322907, 0.37276872, 0.54375919]))
kwds = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=1e-05, atol=0', 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=0
E           
E           Mismatched elements: 3 / 3 (100%)
E           Max absolute difference: 0.03286228
E           Max relative difference: 0.06556053
E            x: array([0.429665, 0.396271, 0.576621])
E            y: array([0.403229, 0.372769, 0.543759])

C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\contextlib.py:[81](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:82): AssertionError

dbs4261 · 2024-10-30T18:39:04Z

I can take a look, they are both in the python library, correct?

PKizzle · 2024-10-31T09:01:53Z

Yes, I guess so. You can find them here:

python\test\t\geometry\test_trianglemesh.py:859
python\test\test_color_map_optimization.py:49

dbs4261 changed the title ~~Omp2tbb~~ concert Jan 27, 2024

dbs4261 changed the title ~~concert~~ Convert OpenMP parallelization to OneAPI::TBB Jan 27, 2024

ssheorey requested review from errissa and ssheorey February 10, 2024 00:03

ssheorey requested a review from benjaminum February 20, 2024 15:45

ssheorey reviewed Feb 27, 2024

View reviewed changes

ssheorey mentioned this pull request Mar 15, 2024

Error when installing open3d for conda environment, missing libomp, seg fault when installed #6196

Open

3 tasks

ssheorey linked an issue Mar 15, 2024 that may be closed by this pull request

Error when installing open3d for conda environment, missing libomp, seg fault when installed #6196

Open

3 tasks

benjaminum reviewed Mar 15, 2024

View reviewed changes

ssheorey added this to the v0.20 milestone Apr 29, 2024

ssheorey added the build/install Build or installation issue label Apr 30, 2024

dbs4261 force-pushed the omp2tbb branch from c897a5a to ec98cae Compare September 16, 2024 21:26

dbs4261 added 6 commits September 16, 2024 21:22

Switched from using OpenMP for parallelism to using oneAPI::TBB (thou…

46e7d63

…gh not using the oneapi scope). Untested but building.

Swichted usage of std::mutex to tbb::mutex for consistency when used …

0c36784

…in conjunction with tbb parallel constructs.

Fixed bug in CPU reduction

1514fc1

Shift atomic to outside of RW mutex in PointCloudSegmentation.cpp

31f1e25

Switch from using a mutex to a concurrent vector for parallelization …

d7a82f1

…of self intersecting triangles.

Get maximum threads from TBB instead of OpenMP

181a431

dbs4261 and others added 16 commits September 16, 2024 21:23

Updated ClusterDBSCAN in PointCloudCluster.cpp to bulk update the pro…

bd258e3

…gress bar to limit spinning on the mutex.

Applied Open3D style

2789817

Marked single argument constructors for reductions as explicit to ple…

6399001

…ase codacy

Explicitly load atomics in calls to utility::Log*(...) because newer …

11c546d

…versions of format wont automatically convert it to its underlying type.

style fix

f48abcc

Fixed incorrect include in Parallel.h

c04077f

Updated ProgressBar.h to use size_t in std namespace for consistency

7d6c707

Switched std::lock_guard<std::mutex> for DiscreteGenerator to use tbb…

bba01ef

…::spin_mutex::scoped_lock.

Switch to using tbb::parallel_reduce on all platforms.

ec98cae

Fixed progress bar bug that allow for multiple threads writing to the…

dbe4133

… terminal.

Fixed issue with CPU reduction. Need to exit early if no work is to b…

25d8e7f

…e done to prevent assignment to the output pointer.

Updated MemoryManagerStatistic to use a tbb::spin_mutex instead of a …

929f010

…std::mutex

Updated point cloud segmentation to generate random samples using the…

31484e5

… global mutex from utilities::random.

Updated example usage of global mutex and engine access to reflect th…

81b9bb3

…e TBB types.

Merge branch 'main' into omp2tbb

23ce717

benjaminum mentioned this pull request Sep 28, 2024

Faster CPU (Arg-)Reductions #6989

Merged

9 tasks

Turn TBB into publicly linked library

86a474d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert OpenMP parallelization to OneAPI::TBB #6626

Convert OpenMP parallelization to OneAPI::TBB #6626

dbs4261 commented Jan 27, 2024

update-docs bot commented Jan 27, 2024

dbs4261 commented Jan 27, 2024

ssheorey commented Jan 30, 2024

errissa commented Feb 6, 2024

dbs4261 commented Feb 6, 2024

ssheorey commented Feb 6, 2024

dbs4261 commented Feb 7, 2024

errissa commented Feb 7, 2024

dbs4261 commented Feb 7, 2024

ssheorey commented Feb 10, 2024

ssheorey left a comment

ssheorey Feb 24, 2024

dbs4261 Feb 27, 2024

ssheorey Feb 24, 2024

ssheorey Feb 24, 2024

ssheorey Feb 27, 2024

dbs4261 Feb 27, 2024

ssheorey Feb 27, 2024

dbs4261 Feb 27, 2024 •

edited

Loading

ssheorey Feb 27, 2024

dbs4261 Feb 27, 2024

ssheorey commented Mar 7, 2024

benjaminum Mar 15, 2024

dbs4261 Mar 15, 2024

PKizzle commented Oct 24, 2024 •

edited

Loading

dbs4261 commented Oct 30, 2024

PKizzle commented Oct 31, 2024

Convert OpenMP parallelization to OneAPI::TBB #6626

Are you sure you want to change the base?

Convert OpenMP parallelization to OneAPI::TBB #6626

Conversation

dbs4261 commented Jan 27, 2024

Type

Motivation and Context

Checklist:

Description

update-docs bot commented Jan 27, 2024

dbs4261 commented Jan 27, 2024

ssheorey commented Jan 30, 2024

errissa commented Feb 6, 2024

dbs4261 commented Feb 6, 2024

ssheorey commented Feb 6, 2024

dbs4261 commented Feb 7, 2024

errissa commented Feb 7, 2024

dbs4261 commented Feb 7, 2024

ssheorey commented Feb 10, 2024

ssheorey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbs4261 Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssheorey commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PKizzle commented Oct 24, 2024 • edited Loading

dbs4261 commented Oct 30, 2024

PKizzle commented Oct 31, 2024

dbs4261 Feb 27, 2024 •

edited

Loading

PKizzle commented Oct 24, 2024 •

edited

Loading