Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlx5: fix a use-after-free error in mlx5_next_poll #1526

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions providers/mlx5/cq.c
Original file line number Diff line number Diff line change
Expand Up @@ -1070,6 +1070,8 @@ static inline void _mlx5_end_poll(struct ibv_cq_ex *ibcq,
if (lock)
mlx5_spin_unlock(&cq->lock);

cq->cur_rsc = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have that code (i.e. cq->cur_rsc = NULL) as part of mlx5_start_poll(), so it seems redundant.

Note:
In any case, it had to be done before line 1070 to be done under the lock, if the lock was applicable.


if (stall) {
if (stall == POLLING_MODE_STALL_ADAPTIVE) {
if (!(cq->flags & MLX5_CQ_FLAGS_FOUND_CQES)) {
Expand Down Expand Up @@ -1822,6 +1824,14 @@ void __mlx5_cq_clean(struct mlx5_cq *cq, uint32_t rsn, struct mlx5_srq *srq)
if (!cq || cq->flags & MLX5_CQ_FLAGS_DV_OWNED)
return;

/*
* Reset the cq->cur_rsc if it is associated with the QP to be
* destroyed in order to prevent use-after-free errors in the
* next ibv_next_poll().
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?

In addition, if the 'lock' mode was used, this code will run only after mlx5_end_poll(), so the next mlx5_start_poll() should do the work by setting the pointer to NULL upon its entrance.

Note:
The code should not protect against incorrect application behavior (e.g., destroying a QP while still polling for its completions), especially in areas that might impact the data path.

Copy link
Author

@FujiZ FujiZ Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?

Yeah in this case multiple QPs are attached to the same CQ.
The code sequence to reproduce this problem would look like this:

ibv_start_poll(); // get a work completion associated with the QP A
ibv_destroy_qp(); // destroy QP A here since we get a completion error
ibv_next_poll(); // try to get the next work completion for other QPs from the same CQ. UAF error is triggered here

destroying a QP while still polling for its completions

Do you mean that destroying a QP between ibv_start_poll() and ibv_end_poll() is not permitted? However, I haven't find any manual which describes this behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's permitted but doesn't look like a good practice, a few notes below.

You are referring only to a single-threaded application, correct? In a multi-threaded application, the call to ibv_destroy_qp() will remain blocked until ibv_end_poll() is invoked, ensuring that ibv_next_poll() is safe to use.

Additionally, we are only discussing a scenario where the CQ serves multiple QPs. Otherwise, it would not make sense to destroy a QP and continue polling its CQ, as this would clearly indicate incorrect application behavior.

The commit log should be rephrased to clarify the exact use case that we are talking about.

So, in the specific scenario that you are talking about, it can make sense to set the NULL upon __mlx5_cq_clean() while narrowing the comment near this line to be more specific as was mentioned above.

In any case, no need for a change in mlx5_end_poll() as I already mentioned.

if (unlikely(cq->cur_rsc && rsn == cq->cur_rsc->rsn))
cq->cur_rsc = NULL;

/*
* First we need to find the current producer index, so we
* know where to start cleaning from. It doesn't matter if HW
Expand Down