diff --git a/Gemfile.lock b/Gemfile.lock index 5e4dac5..fd137bb 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -67,6 +67,7 @@ GEM webrick (1.7.0) PLATFORMS + arm64-darwin-22 x86_64-darwin-19 x86_64-darwin-21 diff --git a/_config.yml b/_config.yml index fc253ab..4f0402d 100644 --- a/_config.yml +++ b/_config.yml @@ -8,7 +8,7 @@ # For technical reasons, this file is *NOT* reloaded automatically when you use # 'bundle exec jekyll serve'. If you change this file, please restart the server process. # -# If you need help with YAML syntax, here are some quick references for you: +# If you need help with YAML syntax, here are some quick references for you: # https://learn-the-web.algonquindesign.ca/topics/markdown-yaml-cheat-sheet/#yaml # https://learnxinyminutes.com/docs/yaml/ # @@ -37,6 +37,8 @@ plugins: # Hide future dated posts from live blog future: false +markdown: kramdown + # Exclude from processing. # The following items will not be processed, by default. # Any item listed under the `exclude:` key here will be automatically added to diff --git a/_includes/head.html b/_includes/head.html new file mode 100644 index 0000000..680c806 --- /dev/null +++ b/_includes/head.html @@ -0,0 +1,15 @@ +
+ + + + {%- seo -%} + + {%- feed_meta -%} + {%- if jekyll.environment == 'production' and site.google_analytics -%} + {%- include google-analytics.html -%} + {%- endif -%} + + {% include mathjax.html %} + + + diff --git a/_includes/mathjax.html b/_includes/mathjax.html new file mode 100644 index 0000000..c72c539 --- /dev/null +++ b/_includes/mathjax.html @@ -0,0 +1,8 @@ + diff --git a/_posts/2023-08-31-using-kani-to-validate-security-boundaries-in-aws-firecracker.md b/_posts/2023-08-31-using-kani-to-validate-security-boundaries-in-aws-firecracker.md new file mode 100644 index 0000000..67c8399 --- /dev/null +++ b/_posts/2023-08-31-using-kani-to-validate-security-boundaries-in-aws-firecracker.md @@ -0,0 +1,288 @@ +--- +layout: post +title: Using Kani to Validate Security Boundaries in AWS Firecracker +--- + +Security assurance is paramount for any system running in the cloud. In order to achieve the highest levels of security, we have applied the [Kani model checker](https://github.com/model-checking/kani) to verify safety-critical properties in core components of the [Firecracker Virtual Machine Monitor](https://firecracker-microvm.github.io/) using mathematical logic. + +Firecracker is an open source project written in Rust which uses the Linux Kernel-based Virtual Machine (KVM) to create and manage microVMs. Firecracker has a minimalist design which allows fast (~150ms) microVM start-up time, secure multi-tenancy of microVMs on the same host and memory/CPU over-subscription. Firecracker is currently used in production by AWS Lambda, AWS Fargate and parts of AWS Analytics to build their service platforms. + +For the past 7 months, [Felipe Monteiro](https://feliperodri.github.io/), an Applied Scientist on the Kani team and [Patrick Roy](https://uk.linkedin.com/in/patrick-roy-31929323a), a Software Development Engineer from the AWS Firecracker team, collaborated to develop Kani harnesses for Firecracker. As a result of this collaboration, the Firecracker team is now running 27 Kani harnesses across 3 verification suites in their continuous integration pipelines (taking approximately 15 minutes to complete), ensuring that all checked properties of critical systems are upheld on every code change. + +In this blog post, we show how Kani helped Firecracker harden two core components, namely our I/O rate limiter and I/O transport layer (VirtIO), presenting the issues we were able to identify and fix. Particularly, the second part of this post picks up from a [previous Kani/Firecracker blogpost](https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html) and shows how improvements to Kani over the last year made verifying conformance with a section of the VirtIO specification feasible. + +## Noisy-Neighbor Mitigations via Rate Limiting + +In multi-tenant systems, microVMs from different customers simultaneously co-exist on the same physical host. We thus need to ensure that access to host resources, such as disk and network, is shared fairly. We should not allow a single “greedy” microVM to transfer excessive amounts of data from disk to the point where other microVMs’ disk access gets starved off (a “noisy neighbor” scenario). Firecracker offers a mitigation for this via *I/O rate-limiting*. From the [documentation](https://github.com/firecracker-microvm/firecracker/blob/4a3e9bd76d9fc57a3538c1aeb7e5687de43a0efa/docs/design.md#io-storage-networking-and-rate-limiting): + + +>Firecracker provides VirtIO/block and VirtIO/net emulated devices, along with the application of rate limiters to each volume and network interface to make sure host hardware resources are used fairly by multiple microVMs. These are implemented using a token bucket algorithm [...] + + +In a token bucket based rate-limiter, each microVM has a budget of “tokens” that can be exchanged for permission to do one byte of I/O. These tokens regenerate at a fixed rate, and if the microVM runs out of tokens, it gets I/O-throttled. This process of draining and replenishing is best visualized by an actual bucket into which water drips at a fixed rate, and from which water can be extracted at some limited rate: + + + +The property we want to verify is that a microVM is not allowed to exceed the configured maximum I/O throughput rate. For a virtual block device rate-limited at 1GB/s, we want to prove that in any one-second interval, at most 1GB of data is allowed to pass through the device. + +What sounds simple in theory is actually fairly difficult to implement. For example, due to a [rounding error](https://github.com/firecracker-microvm/firecracker/pull/3706) a guest could, in some scenarios, do up to 0.01% more I/O than configured. We discovered this bug thanks to a Kani harness for our throughput property stated above, and this harnesses is the main focus of the rest of this section. + +### Teaching Kani about Time + +The core component of our rate-limiting implementation is a `TokenBucket`. In Firecracker, we define it as + +```rs +pub struct TokenBucket { + // Maximal number of tokens this bucket can hold. + size: u64, + + // Complete refill time in milliseconds. + refill_time: u64, + + // Current token budget. + budget: u64, + + // Last time this token bucket was replenished. + last_update: Instant, + + // -- snip -- +} +``` + +It offers an `auto_replenish` function which computes how many tokens the leaky bucket algorithm should have generated since `last_update` (and then updates `last_update` accordingly). This function will be the target of our verification. + +A `TokenBucket` is inherently tied to time-related APIs such as `std::time::Instant`, for which Kani does not have built-in support. This means it is not able to reason about `TokenBucket`s. To solve this problem, we use Kani’s [stubbing](https://model-checking.github.io/kani-verifier-blog/2023/02/28/kani-internship-projects-2022-stubbing.html) to provide a model for the `Instant::now` function. Since Firecracker uses a monotonic clock for its rate-limiting, this stub needs to return non-deterministic monotonically non-decreasing instants. + +However, when trying to stub `now`, one will quickly notice that `Instant` does not offer any constructors for creating an instance from, say, a Unix timestamp. In fact, it is impossible to construct an `Instant` outside of the standard library as its fields are private. When in such a situation, the solution is often to go down the call stack of the function that you want to stub, to see if any of the functions further down can be stubbed out instead to achieve the desired effect. In our case, `now` calls functions in (private) OS-specific time modules, until it bottoms out at [`libc::clock_gettime`](https://www.gnu.org/software/libc/manual/html_node/Getting-the-Time.html#index-clock_005fgettime). + +The `clock_gettime` function is passed a pointer to a `libc::timespec` structure, and the `tv_sec` and `tv_nsec` members of this structure are later used to construct the `Instant` returned by `Instant::now`. Therefore, we can use the following stub to achieve our goal of getting non-deterministic, monotonically non-decreasing `Instant`s: + +```rs +mod stubs { + static mut LAST_SECONDS: i64 = 0; + static mut LAST_NANOS: i64 = 0; + + const NANOS_PER_SECOND: i64 = 1_000_000_000; + + pub unsafe extern "C" fn clock_gettime(_clock_id: libc::clockid_t, tp: *mut libc::timespec) -> libc::c_int { + unsafe { + // kani::any_where provides us with a non-deterministic number of seconds + // that is at least equal to LAST_SECONDS (to ensure that time only + // progresses forward). + let next_seconds = kani::any_where(|&n| n >= unsafe { LAST_SECONDS }); + let next_nanos = kani::any_where(|&n| n >= 0 && n < NANOS_PER_SECOND); + + if next_seconds == LAST_SECONDS { + kani::assume(next_nanos >= LAST_NANOS ); + } + + (*tp).tv_sec = LAST_SECONDS; + (*tp).tv_nsec = LAST_NANOS; + + LAST_SECONDS = next_seconds; + LAST_NANOS = next_nanos; + } + + 0 + } +} +``` + +Note how the first invocation of this stub will always set `tv_sec = tv_nsec = 0`, as this is what the statics are initialized to. This is an optimization we do because the rate-limiter only cares about the delta between two instants, which will be non-deterministic as long as one of the two instants is non-deterministic. **In order to keep Kani performant, it is important to minimize the number of non-deterministic values, especially if multiplication and division are involved**. + +Using this stub, we can start writing a harness for `auto_replenish` such as + +```rs +#[kani::proof] +#[kani::unwind(1)] // Enough to unwind the recursion at `Timespec::sub_timespec`. +#[kani::stub(libc::clock_gettime, stubs::clock_gettime)] +fn verify_token_bucket_auto_replenish() { + // Initialize a non-determinstic `TokenBucket` object. + let mut bucket: TokenBucket = kani::any(); + + bucket.auto_replenish(); + + // is_valid() performs sanity checks such as "budget <= size". + // It is the data structure invariant of `TokenBucket`. + assert!(bucket.is_valid()); +} +``` + +Let us now see how we can extend this harness to allow us to verify that our rate limiter is replenishing tokens at exactly the requested rate. + +### Verifying our Noisy-Neighbor Mitigation + +Our noisy neighbor mitigation is correct if we always generate the “correct” number of tokens with each call to `auto_replenish`, meaning it is impossible for a guest to do more I/O than configured. Formally, this means + +$$0 \leq \left(now - last{\_}update\right) - \left( new{\_}tokens \times \left(\frac{refill{\_}time}{size}\right) \right) < \left(\frac{refill{\_}time}{size}\right)$$ + +Here, $new\\_tokens$ is the number of tokens that `auto_replenish` generated. The fraction $\left(\frac{refill\\_time}{size}\right)$ is simply the time it takes to generate a single token. Thus, the property states that if we compute the time that it should have taken to generate $new\\_tokens$ and subtract it from the time that actually passed, we are left with an amount of time less than what it would take to generate an additional token: we replenished the maximal number of tokens possible. + +The difficulty of implementing a correct rate limiter is dealing with “leftover” time: If enough time passed to generate “1.8 tokens”, what does Firecracker do with the “0.8” tokens it cannot (as everything is integer valued) add to the budget? Originally, the rate limiter simply dropped these: if you called `auto_replenish` at an inopportune time, then the “0.8” would not be carried forward and the guest essentially “lost” part of its I/O allowance to rounding. Then, with [#3370](https://github.com/firecracker-microvm/firecracker/pull/3370), we decided to fix this by only advancing $last\\_update$ by $new\\_tokens \times \left(\frac{refill\\_time}{size}\right)$ instead of setting it to `now`. This way the fractional tokens will be carried forward, and we even hand-wrote a [proof](https://github.com/firecracker-microvm/firecracker/pull/3370#pullrequestreview-1252110534) to check that $last\\_update$ and the actual system time will not diverge, boldly concluding + + +>This means that $last\\_updated$ indeed does not fall behind more than the execution time of `auto_replenish` plus a constant dependent on the bucket configuration. + + +Here, the “constant dependent on the bucket configuration” was $\left(\frac{refill\\_time}{size}\right)$, rounded down. This is indeed implies our above specified property, so when we revisited `auto_replenish` a few months later to add the following two `debug_asserts!` derived from our formal property. + +```rs +// time_adjustment = tokens * (refill_time / size) +debug_assert!((now - last_update) >= time_adjustment); +// inequality slightly rewritten to avoid division +debug_assert!((now - last_update - time_adjustment) * size < refill_time); +``` + +we expected the verification to succeed. However, Kani presented us with The "**VERIFICATION FAILED**" message, which was unexpected to say the least. + +So what went wrong? In the hand-written proof, the error was assuming that $-\lfloor -x \rfloor = \lfloor x \rfloor$ (had this step been gotten correctly, the bound would have been $\left(\frac{refill\\_time}{size}\right)$ rounded *up*, which obviously allows for violations). To see how our code actually violates the property, we need to have a look at how the relevant part of `auto_replenish` was actually implemented: + +```rs +let time_delta = self.last_update.elapsed().as_nanos() as u64; + +// tokens = time_delta / (refill_time / size) rewritten to not run into +// integer division issues. +let tokens = (time_delta * self.size) / self.refill_time; +let time_adjustment = (tokens * self.refill_time) / self.size + +self.last_update += Duration::from_nanos(time_adjustment); +``` + +The issue lies in the way we compute `time_adjustment`: Consider a bucket of size 2 with refill time 3ns and assume a time delta of 11ns. We compute $11 \times 2/3 \approx 7$ tokens, and then a time adjustment of $7 \times 3/2 \approx 10ns$. However, 10ns is only enough to replenish $10 \times 2/3 \approx 6$ tokens! The problem here is that 7 tokens do not take an integer number of nanoseconds to replenish. They take 10.5ns. However the integer division rounds this down, and thus the guest essentially gets to use those 0.5ns twice. Assuming the guest can time when it triggers down to nanosecond precision, and the rate limiter is configured such that $\left(\frac{refill\\_time}{size}\right)$ is not an integer, the guest could theoretically cause these fractional nanoseconds to accumulate to get an extra token every $10^{6} \times \left(\frac{refill\\_time}{size}\right) \times max\left(1, \left(\frac{refill\\_time}{size}\right)\right)$ nanoseconds. **For a rate limiter configured at 1GB/s, this would be an excess of 1KB/s**. + +The fix for this was to round up instead of down in our computation of `time_adjustment`. For the complete code listing of the rate limiter harnesses, see [here](https://github.com/firecracker-microvm/firecracker/blob/1a2c6ada116b52df891857d3e82503ad1ef845e5/src/vmm/src/rate_limiter/mod.rs#L525). + +## Conformance to the VirtIO Specification + +Firecracker is a para-virtualization solution, meaning the guest is aware that it is running inside of a virtual machine. This allows host and guests to collaborate when it comes to I/O, as opposed to the host having to do all the heavy lifting of emulating physical devices. Firecracker uses [VirtIO](https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf) for the transport-layer protocol of its paravirtualized device stack. It allows the guest and host to exchange messages via pairs of ring buffers called a *queue*. At a high level, the guest puts requests into a shared array (the “descriptor table”) and puts the index into the descriptor table at which the host can find the new request into the request ring (the “avail ring” in VirtIO lingo). It then notifies the host via interrupt that a new request is available for processing. The host now processes the request, updating the descriptor table entry with its response and, upon finishing, writes the index into the descriptor table into a response ring (the “used ring”). It then notifies the guest that processing of a request has finished. + +The Firecracker side of this queue implementation sits right at the intersection between guest and host. According to Firecracker’s [threat model](https://github.com/firecracker-microvm/firecracker/blob/main/docs/design.md#threat-containment): + + +>From a security perspective, all vCPU threads are considered to be running untrusted code as soon as they have been started; these untrusted threads need to be contained. + + +The entirety of the VirtIO queue lives in shared memory and can thus be written to by the vCPU threads. Therefore, Firecracker cannot make any assumptions about its contents. In particular, it needs to operate securely no matter the memory content. For anyone who has worked with Kani before, this yearns for a generous application of `kani::vec::exact_vec`, which generates a fixed size vector filled with arbitrary values. We can set up an area of non-deterministic guest memory as follows: + +```rs +fn arbitrary_guest_memory() -> GuestMemoryMmap { + // We need ManuallyDrop to "leak" the memory area to ensure it lives for + // the entire duration of the proof. + let memory = ManuallyDrop::new(kani::vec::exact_vec::