Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narrow bounds on kernel malloc allocations. #2261

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

qwattash
Copy link
Contributor

@qwattash qwattash commented Dec 5, 2024

This is required to prevent memory leaks through the extra space in the allocations.
In general, kernel malloc uses a number of fixed-size UMA zones to back small allocations (<= 64K) and a dedicated malloc_large for large allocations that grabs memory directly from kmem.
In both cases, the requested allocation size may be smaller than the actual committed size if the requested size is not representable. Representablity is platform-dependent and we can not make assumptions about the minimum representable size (although I think it will always be >= PAGE_SIZE).

With CHERI this is problematic because the caller will receive a capability that might be larger than the requested size and some of the additional committed space is reachable but uninitialized (unless M_ZERO).
This patch does the following:

  1. Add an interface function to UMA to recover the original item bounds (provisionally named uma_zgrow_bounds, I'm open to better names).
  2. Narrow bounds to the closest representable size in kernel malloc and realloc variants.
  3. Recover original bounds for both small and large allocations on free, so that we maintain the invariant that the lower-level allocator always receives on free() the same capability returned on alloc. In order to do this, we (continue to) abuse the vtoslab infrastructure for malloc_large to stash the original pointer in place of the zone pointer in the vm_page_t descriptor. I really think we should have a cleaner interface for this, but it is beyond the point of this patch.
  4. Explicitly zero the committed allocation space that lies between the requested allocation size and the actual capability top. In other words, we zero the representability padding of the allocation.

qwattash and others added 9 commits December 5, 2024 13:44
The uma_zgrow_bounds function allows to recover the full bounds on an item from
nested allocators that rely on UMA.
This prevents possible overflows into the slab header.
 - Narrow bounds on kmalloc allocations backed by UMA zones, this prevents extra
 space in items to be accessible.
 - When the requested size is not representable, imply M_ZERO for UMA
 allocations. This ensures that no data is accidentally leaked through extra
 space in the item. Note that the current malloc KPI does not promise to protect
 from memory leaks due to improper initialization if M_ZERO is not set,
 therefore it might be sensible to only zero-initialize the padding space beyond
 the requested allocation size, which will likely not be initialized by the
 client code.

Patch co-authored by: Alfredo Mazzinghi <[email protected]>
This handles malloc_large allocations in the same way of UMA-backed allocations.
When the capability would not be representable, imply M_ZERO to ensure no data
leaks through the padding.

Patch co-authored by: Alfredo Mazzinghi <[email protected]>
This is only affecting memstat, which only uses the private UMA structure
definitions.
Only zero the padding due to representable length rounding
for the requested allocation size. This should generally be cheaper
than zeroing the whole allocation.
Only zero the padding due to representable length rounding
for the requested allocation size. This should generally be cheaper
than zeroing the whole allocation.
When growing an existing allocation with an unrepresentable length,
we need to initialize any padding space to ensure there are no data leaks.
When copying data to a new allocation, we must ensure that no data
past the original capability length is copied.
@@ -680,7 +680,7 @@ void *
int indx;
caddr_t va;
uma_zone_t zone;
#if defined(DEBUG_REDZONE) || defined(KASAN)
#if defined(DEBUG_REDZONE) || defined(KASAN) || defined(__CHERI_PURE_CAPABILITY__)
unsigned long osize = size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit a temptation to slap a __diagused on at least the local variables and make them unconditional.

@bsdjhb
Copy link
Collaborator

bsdjhb commented Dec 6, 2024

Hmm, note that in userspace we have recently switched to not forcing tight bounds but in effect setting the bounds to what malloc_usable_size() would return for non-CHERI, that is whatever the backing allocation is. This gives better performance for realloc() loops to grow an allocation. I think we still have a knob to enforce tight bounds for demo purposes, but for practical purposes the wider bounds are safe (there is no danger of memory reuse). From the description it sounds like you are using exact bounds when e.g. a 17 byte allocation allocates from a 32-byte UMA bucket rather than just leaving the bounds at 32? If that is the case, I'd be inclined to just leave the bounds at 32.

@qwattash
Copy link
Contributor Author

qwattash commented Dec 10, 2024

Hmm, note that in userspace we have recently switched to not forcing tight bounds but in effect setting the bounds to what malloc_usable_size() would return for non-CHERI, that is whatever the backing allocation is. This gives better performance for realloc() loops to grow an allocation. I think we still have a knob to enforce tight bounds for demo purposes, but for practical purposes the wider bounds are safe (there is no danger of memory reuse). From the description it sounds like you are using exact bounds when e.g. a 17 byte allocation allocates from a 32-byte UMA bucket rather than just leaving the bounds at 32? If that is the case, I'd be inclined to just leave the bounds at 32.

This was my original reasoning as well, however there is an interesting interaction with the caller responsibility for initialization. This patch is not really about bounds and more about memory leaks through uninit memory.

I will explain better:
In general it is never a spatial safety violation to give out more space than requested but the caller will not know (nor it should be required to know) how much extra memory is committed to the allocation.
In the absence of M_ZERO, this creates a situation in which the caller can be expected to take responsibility for initializing the memory it requests. If there are leaks through padding, for instance, it is not the allocator's fault.

However, in the CHERI world (and arguably also in the non-CHERI world), the caller can access outside of the requested bounds if the allocator committed a larger chunk of memory (e.g. a 64 bytes UMA block for a 48 bytes requested size).
Again, in the absence of M_ZERO, the allocator will not initialize this memory, so it is practically possible to leak data and capabilities from a previous kmalloc allocation that used the whole 64 bytes block. Interestingly this is picked up by KASAN.

This becomes increasingly tricky with unrepresentable sizes, because you can not guarantee that it is possible to narrow bounds exactly in all cases, so you still need zero-initialization to protect against data/capability leaks across kmalloc callers.

This patch use of bounds is really about reducing the need for zero-initialization. I think there are essentially 3 options:

  1. Imply M_ZERO for every allocation where real_size > requested_size
  2. When not M_ZERO, clear the extra committed space between alloc + requested_size and alloc + real_size. Note that this can end up zeroing a significant chunk of memory (e.g. requested_size = 32K + 1 will use a 64K block and clear 32K - 1 bytes.
  3. When not M_ZERO, set bounds of the capability up to the next representable boundary and only clear the region [alloc + requested_size, alloc + representable_size]. For small allocations < PAGE_SIZE we never have to zero anything and for larger one we may have to clear some bytes, but it should generally be less than the whole padding region.

The root cause is using shared UMA zones across different C types, but this is a deeper issue.

Anyway, I agree that having a knob to turn it off is useful, and I will add one.

@bsdjhb
Copy link
Collaborator

bsdjhb commented Dec 10, 2024

Hmmm, I guess part of what threw me off is that in the original PR comment, bullet point 4 seems to be option 2, so it read like we were doing both 1 and 2. I do think of the available options I prefer something along the lines of option 2) for allocations <= PAGE_SIZE (which avoids the need for the extra complications to re-derive bounds for allocations using the small UMA buckets which I think are most likely to be potential hot paths) and option 3) for allocations > PAGE_SIZE. In the case of option 3, one thing you might consider is seeing if you can arrange for the layer that allocates VM pages to know about the padding area and try to allocate zeroed pages for the padding in the !M_ZERO case to avoid having to zero if the pages are already zeroed.

@qwattash
Copy link
Contributor Author

qwattash commented Dec 10, 2024

Hmmm, I guess part of what threw me off is that in the original PR comment, bullet point 4 seems to be option 2, so it read like we were doing both 1 and 2. I do think of the available options I prefer something along the lines of option 2) for allocations <= PAGE_SIZE (which avoids the need for the extra complications to re-derive bounds for allocations using the small UMA buckets which I think are most likely to be potential hot paths) and option 3) for allocations > PAGE_SIZE. In the case of option 3, one thing you might consider is seeing if you can arrange for the layer that allocates VM pages to know about the padding area and try to allocate zeroed pages for the padding in the !M_ZERO case to avoid having to zero if the pages are already zeroed.

I could have phrased it a bit better probably. I think allocating zeroed pages though malloc_large is feasible, although I wonder whether it is just better to imply M_ZERO at that point and let/teach the underlying allocation pick zeroed pages. Plumbing it through UMA zones up to 64K is probably complicated because items will never return to the VM allocator (I need to re-check what UMA actually does, but I think it tries at least to cache small buckets of items larger than PAGE_SIZE). I would need to teach UMA to swap the physical pages from underneath the allocation with something zeroed and would have implications for ctors/dtors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants