In this demo we are going to show how privilege escalation can be managed through the use of no_new_privs in order to avoid use of SUID binaries and binaries with file capabilities.
NOTE: Below tests were run in a Fedora 38 machine with Podman v4.6.2, results may vary when using other O.S / Podman versions.
In this demo we will see how no_new_privs can be used to avoid SUID binaries.
-
Create a small container image that has the
whoami
binary configured with SETUID bitcat <<EOF > /tmp/whoami-setuid.dockerfile FROM fedora:38 RUN chmod +s /usr/bin/whoami ENTRYPOINT /usr/bin/whoami EOF
podman build -f /tmp/whoami-setuid.dockerfile -t whoami-setuid
-
If we run the image as user
1024
without setting the no_new_privs bit this is what we getpodman run -it --rm --user=1024 whoami-setuid
NOTE: As you can see, the privilege escalation happened.
root
-
If we run the image as user 1024 with the no_new_privs bit set
podman run -it --rm --user=1024 --security-opt=no-new-privileges whoami-setuid
NOTE: In this case, the privilege escalation was blocked.
1024
In this demo we will see how no_new_privs can be used to avoid users to use binaries with file capabilities if those capabilities are not in their permitted
and effective
thread's capability set.
-
Run the container with user 1024 and without setting the no_new_privs bit
NOTE: The image we are using for our test has a small web service with the
NET_BIND_SERVICE
file capability configured in the binary.podman run --rm -it --entrypoint /bin/bash --user 1024 -e APP_PORT=80 --name reversewords-test quay.io/mavazque/reversewords-captest:latest
-
Get the file capabilities for the web service binary
getcap /usr/bin/reverse-words
/usr/bin/reverse-words = cap_net_bind_service+eip
-
Get the container's thread capabilities
grep Cap /proc/1/status
CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000800405fb CapAmb: 0000000000000000
-
Decode thread capabilities
capsh --decode=00000000800405fb
0x00000000800405fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
-
Execute the binary
/usr/bin/reverse-words
NOTE: As expected, the binary was able to get the
NET_BIND_SERVICE
capability2023/10/17 07:26:06 Starting Reverse Api v0.0.21 Release: NotSet 2023/10/17 07:26:06 Listening on port 80
-
We are going to run the container with the _no_new_privs` bit set
podman run --rm -it --entrypoint /bin/bash --security-opt no-new-privileges --user 1024 -e APP_PORT=80 --name reversewords-test quay.io/mavazque/reversewords-captest:latest
-
File caps and thread caps remain the same as in the previous run, let's run the binary
/usr/bin/reverse-words
NOTE: This time the binary couldn't get the capability into the thread's effective set due to the no_new_privs bit.e
2023/10/17 07:26:19 Starting Reverse Api v0.0.21 Release: NotSet 2023/10/17 07:26:19 Listening on port 80 2023/10/17 07:26:19 listen tcp :80: bind: permission denied
-
If we run with root this time and with no_new_privs
podman run --rm -it --entrypoint /bin/bash --security-opt no-new-privileges --user 0 -e APP_PORT=80 --name reversewords-test quay.io/mavazque/reversewords-captest:latest
-
Get the container's thread capabilities
grep Cap /proc/1/status
CapInh: 0000000000000000 CapPrm: 00000000800405fb CapEff: 00000000800405fb CapBnd: 00000000800405fb CapAmb: 0000000000000000
-
Decode thread effective capabilities
capsh --decode=00000000800405fb
0x00000000800405fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
-
This time we got the NET_BIND_SERVICE in the thread's effective set, which means we will be able to use it since we don't need to raise it
/usr/bin/reverse-words
2023/10/17 07:27:26 Starting Reverse Api v0.0.21 Release: NotSet 2023/10/17 07:27:26 Listening on port 80
In this demo we're going to show how we can audit containers abusing setuid/sudo on an OpenShift cluster.
Even if file capabilities can cause a privilege escalation we have great tools today to avoid pods from getting such capabilities via SCCs, where the restricted-v2 SCC does a pretty good job limiting the number of CAPS available for pods by default. On the other hand, we have the "becoming root/executing as root" problem, which in this case new v2 SCCs introduced back in OCP 4.11 are helping to mitigate as v2 SCCs have the setting allowPrivilegeEscalation
set to false.
In order to showcase a scenario where an application is abusing the setuid, kudos to r/linuxquestions for providing the testing app (quay.io/fherrman/my-ubi-setuid:0.4):
#include <stdlib.h>
#include <unistd.h>
int main () {
setuid(0);
execl("/bin/bash", "bash", "-p", 0);
}
As you can see the container image will have this binary with setuid configured under /usr/bin/bashwrap:
$ ls -l /usr/bin/bashwrap
-rwsr-xr-x. 1 root root 17488 Nov 12 22:42 /usr/bin/bashwrap
Now that we introduced the app that we will be using we will see how a user can build a container image with an app like the one we have above in order to become root with an SCC that doesn't allow running as root, but does allow privilege escalation.
-
Create a namespace for our tests
oc create ns test-priv-esc
-
Deploy our application
-
Since v2 SCCs restrict the privilege escalation we are going to provide access to the old restricted SCC to showcase this issue to the default SA in this project
oc -n test-priv-esc adm policy add-scc-to-user restricted -z default
-
Deploy the application
cat <<EOF | oc -n test-priv-esc create -f - apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: privescalation name: privescalation spec: replicas: 1 selector: matchLabels: app: privescalation strategy: {} template: metadata: creationTimestamp: null labels: app: privescalation spec: containers: - image: quay.io/fherrman/my-ubi-setuid:0.4 command: ["sleep","9999"] name: my-ubi-setuid resources: {} securityContext: allowPrivilegeEscalation: true status: {} EOF
-
-
If we check the SCC assigned to our pod we will see that we got the
restricted
SCC assigned to our workload:oc -n test-priv-esc get pod -l app=privescalation -o yaml | grep scc
openshift.io/scc: restricted
-
The restricted SCC does not allow containers to run with a root uid (0), let's see how we configured our setuid binary and let's try to execute it:
-
Connect to the pod:
oc -n test-priv-esc rsh deployment/privescalation sh-4.4$
-
Check app binary configuration:
ls -l /usr/bin/bashwrap -rwsr-xr-x. 1 root root 17488 Nov 12 22:42 /usr/bin/bashwrap
-
Execute the app:
/usr/bin/bashwrap bash-4.4#
-
Check our effective uid:
NOTE: As you can see our effective uid is 0.
id uid=1000680000(1000680000) gid=0(root) euid=0(root) groups=0(root),1000680000
-
-
This means that at this point the container process is running as root in the node. Let's verify it:
-
Check the node where the pod is running
OCP_NODE=$(oc -n test-priv-esc get pod -l app=privescalation -o jsonpath='{.items[*].spec.nodeName}')
-
Open a debug session into that node
oc debug node/${OCP_NODE} chroot /host
-
Connect to the pod in a different terminal, execute our app and run a "sleep 288":
oc -n test-priv-esc rsh deployment/privescalation /usr/bin/bashwrap sleep 288
-
Back in the node shell, check the process owner:
ps -ef | grep "sleep 288" | grep -v grep
NOTE: As you can see we're running as root.
root 1352231 1352181 0 07:31 pts/0 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 288
-
At this point we have demonstrated how we can abuse setuid binaries, now let's see what we can do on the node to know when this happens.
-
We will use auditd rules to monitor when a process performs privilege escalation by changing from non-root uid to root uid. The audit rule we will use is the following one:
NOTE: Below rule only targets 64bits syscall, if you want to get the 32bits ones as well you need to edit the rule:
always,exit -F arch=b64 -S execve -C uid!=euid -F euid=0 -k setuid-abuse
-
In order to get this rule to our worker nodes we will create the following MachineConfig:
cat <<EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-auditd-setuid-rule spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,LWEgYWx3YXlzLGV4aXQgLUYgYXJjaD1iNjQgLVMgZXhlY3ZlIC1DIHVpZCE9ZXVpZCAtRiBldWlkPTAgLWsgc2V0dWlkLWFidXNlCg== mode: 420 overwrite: true path: /etc/audit/rules.d/setuid-abuse.rules EOF
-
After our nodes restart we will have the audit rules in place, if we run the same operation in our application pod, we will get something like this in our audit log:
NOTE: Below command should be run in the node (you can use oc debug node as we did before)
grep setuid-abuse /var/log/audit/audit.log type=SYSCALL msg=audit(1637601395.568:124): arch=c000003e syscall=59 success=yes exit=0 a0=4006b0 a1=7ffee2255690 a2=7ffee2255818 a3=1 items=2 ppid=25611 pid=28357 auid=4294967295 uid=1000680000 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="bash" exe="/usr/bin/bash" subj=system_u:system_r:container_t:s0:c15,c26 key="setuid-abuse"ARCH=x86_64 SYSCALL=execve AUID="unset" UID="unknown(1000680000)" GID="root" EUID="root" SUID="root" FSUID="root" EGID="root" SGID="root" FSGID="root"
-
In the audit log we have different information that we can use to identify the container running the privilege escalation.
-
We could send this audit log to a SIEM system and create alerts based on this audit rules.