Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlx5: fix a use-after-free error in mlx5_next_poll #1526

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

FujiZ
Copy link

@FujiZ FujiZ commented Nov 27, 2024

When using cq_ex inteface, if the user destroys the QP associated with the current work completion, the next ibv_next_poll() call will cause a use-after-free error since it needs to access the QP that has already been destroyed through cq->cur_rsc inside get_req_context().

Fix this error by resetting the cq->cur_rsc in __mlx5_cq_clean if it is associated with the QP to be destroyed.

@FujiZ FujiZ force-pushed the fz/fix-cq-cur-rsc branch 2 times, most recently from fbf359f to 1110f71 Compare November 27, 2024 03:46
When using cq_ex inteface, if the user destroys the QP associated with
the current work completion, the next ibv_next_poll() call will cause a
use-after-free error since it needs to access the QP that has already
been destroyed through cq->cur_rsc inside get_req_context().

Fix this error by resetting the cq->cur_rsc in __mlx5_cq_clean if it is
associated with the QP to be destroyed.

Signed-off-by: ZHOU Huaping <[email protected]>
@@ -1070,6 +1070,8 @@ static inline void _mlx5_end_poll(struct ibv_cq_ex *ibcq,
if (lock)
mlx5_spin_unlock(&cq->lock);

cq->cur_rsc = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have that code (i.e. cq->cur_rsc = NULL) as part of mlx5_start_poll(), so it seems redundant.

Note:
In any case, it had to be done before line 1070 to be done under the lock, if the lock was applicable.

* Reset the cq->cur_rsc if it is associated with the QP to be
* destroyed in order to prevent use-after-free errors in the
* next ibv_next_poll().
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?

In addition, if the 'lock' mode was used, this code will run only after mlx5_end_poll(), so the next mlx5_start_poll() should do the work by setting the pointer to NULL upon its entrance.

Note:
The code should not protect against incorrect application behavior (e.g., destroying a QP while still polling for its completions), especially in areas that might impact the data path.

Copy link
Author

@FujiZ FujiZ Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario that you are referring to here ? a CQ which servers more than a single QP ?

Yeah in this case multiple QPs are attached to the same CQ.
The code sequence to reproduce this problem would look like this:

ibv_start_poll(); // get a work completion associated with the QP A
ibv_destroy_qp(); // destroy QP A here since we get a completion error
ibv_next_poll(); // try to get the next work completion for other QPs from the same CQ. UAF error is triggered here

destroying a QP while still polling for its completions

Do you mean that destroying a QP between ibv_start_poll() and ibv_end_poll() is not permitted? However, I haven't find any manual which describes this behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's permitted but doesn't look like a good practice, a few notes below.

You are referring only to a single-threaded application, correct? In a multi-threaded application, the call to ibv_destroy_qp() will remain blocked until ibv_end_poll() is invoked, ensuring that ibv_next_poll() is safe to use.

Additionally, we are only discussing a scenario where the CQ serves multiple QPs. Otherwise, it would not make sense to destroy a QP and continue polling its CQ, as this would clearly indicate incorrect application behavior.

The commit log should be rephrased to clarify the exact use case that we are talking about.

So, in the specific scenario that you are talking about, it can make sense to set the NULL upon __mlx5_cq_clean() while narrowing the comment near this line to be more specific as was mentioned above.

In any case, no need for a change in mlx5_end_poll() as I already mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants