Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with built-in multi-threading #158

Open
arunbpf opened this issue Sep 23, 2024 · 6 comments
Open

Issue with built-in multi-threading #158

arunbpf opened this issue Sep 23, 2024 · 6 comments
Labels
question Further information is requested

Comments

@arunbpf
Copy link

arunbpf commented Sep 23, 2024

Hi,

In the readme, it is mentioned that Pipeless does multi-threading by default. When I am trying to use variables/lists in the post process that are updated across frames (for example, frame count, moving average of predictions from frames), I am running into race conditions. It messes up the calculations that use these variables/lists that are accumulated over a period.

What is the best way to handle this situation and ensure thread safety ?

@miguelaeh
Copy link
Collaborator

You can use the built-in key-value store: https://www.pipeless.ai/docs/docs/v1/kvstore

@miguelaeh
Copy link
Collaborator

You can also ensure those hooks where you process the data are stateful so they sort the frames before running the hook: https://www.pipeless.ai/docs/docs/v1/getting-started#stateless-vs-stateful-hooks

@miguelaeh miguelaeh added the question Further information is requested label Sep 24, 2024
@arunbpf
Copy link
Author

arunbpf commented Oct 4, 2024

Are reading and writing to the built-in key-value store thread safe ?

@miguelaeh
Copy link
Collaborator

Yes, it is

@arunbpf
Copy link
Author

arunbpf commented Oct 14, 2024

When I tried to make the post-process.py stateful by adding the below comment in the first line of the script,

# make stateful

facing the below error from the onnx runtime. It runs fine when used in stateless mode. (I am using a YOLOv8 object detection model (onnx) in process.py to detect the object and a Transformer based PyTorch model (.pth) in post-process.py to do OCR on the detected object region from YOLO).

*************** EP Error ***************
EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 900: operation not permitted when stream is capturing ; GPU=0 ; hostname=ip-172-31-14-86 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=283 ; expr=cudaDeviceSynchronize(); 

 when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
****************************************
*************** EP Error ***************
EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 900: operation not permitted when stream is capturing ; GPU=0 ; hostname=ip-172-31-14-86 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=283 ; expr=cudaDeviceSynchronize(); 

 when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
****************************************
[2024-10-04T11:09:47Z ERROR pipeless_ai::stages::languages::python] Error executing hook: RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 900: operation not permitted when stream is capturing ; GPU=0 ; hostname=ip-172-31-14-86 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=283 ; expr=cudaDeviceSynchronize(); 
    
    
[2024-10-04T11:09:47Z WARN  pipeless_ai::pipeline] No frame returned from path execution, skipping frame forwarding to the output (if any).

0: 640x640 (no detections), 858.1ms

0: 640x640 (no detections), 10.0ms

Overall Post Processing Time:  2.193450927734375e-05
0: 640x640 (no detections), 10.3ms

0: 640x640 (no detections), 9.6ms

2024-10-04 11:09:49.118122639 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/model.0/conv/Conv' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing
[2024-10-04T11:09:49Z ERROR pipeless_ai::stages::languages::python] Error executing hook: Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Conv node. Name:'/model.0/conv/Conv' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing
[2024-10-04T11:09:49Z WARN  pipeless_ai::pipeline] No frame returned from path execution, skipping frame forwarding to the output (if any).

0: 640x640 (no detections), 9.6ms

0: 640x640 (no detections), 10.1ms

2024-10-04 11:09:49.152105642 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/model.0/conv/Conv' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing
[2024-10-04T11:09:49Z ERROR pipeless_ai::stages::languages::python] Error executing hook: Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Conv node. Name:'/model.0/conv/Conv' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing
[2024-10-04T11:09:49Z WARN  pipeless_ai::pipeline] No frame returned from path execution, skipping frame forwarding to the output (if any).

0: 640x640 (no detections), 8.6ms

0: 640x640 (no detections), 10.0ms

0: 640x640 (no detections), 9.9ms

0: 640x640 (no detections), 10.0ms
0: 640x640 (no detections), 7.1ms
0: 640x640 (no detections), 9.9ms
0: 640x640 (no detections), 10.0ms
0: 640x640 (no detections), 7.0ms
0: 640x640 (no detections), 8.1ms
0: 640x640 (no detections), 7.0ms
0: 640x640 (no detections), 12.7ms
[2024-10-04T11:11:35Z INFO  pipeless_ai::input::pipeline] Received received EOS from source pipeline0.
                    Pipeline id: 8958540b-24be-4d75-98da-5f670f93caf5 ended

Any idea why would this happen ?

@miguelaeh
Copy link
Collaborator

If you are using an ONNX model for process, I would recommend using a process.json file instead of a Python file since the performance will be better.
Also, if you are using two models, you should be creating 2 stages instead of using both in the same stage.

Could you give a try with that setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants