Refactoring: Move partition snapshotting to PPM #2303

pcholakov · 2024-11-15T14:29:50Z

This change moves the responsibility for orchestrating snapshot creation to the PartitionProcessorManager, allowing the PartitionProcessor to be more focused on its core task of processing journal operations.

Based on: #2253

github-actions · 2024-11-15T14:46:43Z

Test Results

7 files ±0 7 suites ±0 4m 34s ⏱️ +14s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 55f2941. ± Comparison against base commit b82d591.

♻️ This comment has been updated with latest results.

crates/admin/src/cluster_controller/service.rs

crates/partition-store/src/partition_store.rs

crates/partition-store/src/partition_store_manager.rs

pcholakov · 2024-11-15T15:07:45Z

crates/partition-store/src/partition_store_manager.rs

+        partition_id: PartitionId,
+        archived_lsn: Lsn,
+    ) -> anyhow::Result<()> {
+        let mut guard = self.lookup.lock().await;


crates/worker/src/partition_processor_manager/mod.rs

crates/partition-store/src/partition_store_manager.rs

tillrohrmann

Thanks for creating this PR @pcholakov. I think it looks already quite good. I've left a question about how blocking the export of a column family snapshot is and whether it needs to be run outside of the Tokio threads. I also left a suggestion how you could handle the error cases and responding to a caller a bit more streamlined.

tillrohrmann · 2024-11-15T15:25:05Z

crates/worker/src/partition/mod.rs

-            PartitionProcessorControlCommand::RunForLeader(leader_epoch) => {
+            RunForLeader(leader_epoch) => {


Personal taste: I do prefer the enum type being visible here because it tells me on this line where it is coming from.

Ack, I won't do this :-)

crates/partition-store/src/partition_store_manager.rs

tillrohrmann · 2024-11-15T17:32:28Z

crates/worker/src/partition_processor_manager/snapshot_task.rs

+        let snapshot_id = SnapshotId::new();
+        let snapshot = self
+            .partition_store_manager
+            .export_partition_snapshot(self.partition_id, snapshot_id, self.snapshot_base_path)


How much of a blocking operation is this one? Like how much does snapshotting a CF in RocksDB block?

Answered above - the only part that's blocking is already offloaded as a low-priority StorageTask further down in the RocksDb impl layer.

crates/worker/src/partition_processor_manager/snapshot_task.rs

crates/worker/src/partition_processor_manager/mod.rs

tillrohrmann

Thanks for updating this PR @pcholakov. I think it looks really good. The one thing which wasn't fully clear to me is the need for the explicit archived lsn watch. I think we can remove it and get the information from the returned PartitionSnapshotMetadata.

tillrohrmann · 2024-11-18T11:51:12Z

crates/partition-store/src/partition_store_manager.rs

+        let mut partition_store = self
+            .lookup
+            .lock()
+            .await
+            .live
+            .get_mut(&partition_id)


Could this be simplified via calling self.get_partition_store()?

Definitely! I completely missed that.

crates/worker/src/partition_processor_manager/mod.rs

crates/worker/src/partition_processor_manager/snapshot_task.rs

crates/worker/src/partition_processor_manager/mod.rs

pcholakov

Thanks for your feedback, @tillrohrmann! I believe all the concerns you raised are addressed in the latest revision.

crates/core/src/worker_api/partition_processor_manager.rs

pcholakov · 2024-11-19T14:39:47Z

crates/partition-store/src/partition_store_manager.rs

+        let mut partition_store = self
+            .lookup
+            .lock()
+            .await
+            .live
+            .get_mut(&partition_id)


Definitely! I completely missed that.

crates/worker/src/partition_processor_manager/snapshot_task.rs

pcholakov · 2024-11-19T15:19:55Z

crates/worker/src/partition_processor_manager/snapshot_task.rs

+                warn!(
+                    partition_id = %self.partition_id,
+                    "Failed to create partition snapshot: {}",
+                    err
+                );


Generally, this should be IO errors related to exporting the snapshot itself (something bubbles up from RocksDB's export_column_family) or ancillaries (writing metadata JSON header to disk, in the future - uploading to the object store). Do you think we could do something more besides log it? We're responding to the caller also, so I think it would generally be their responsibility to redrive and/or raise a flare.

I can also see introducing some metrics around snapshotting in the future - successes/errors/bytes uploaded. Maybe save that for when I introduce the object store integration?

crates/worker/src/partition_processor_manager/snapshot_task.rs

crates/worker/src/partition_processor_manager/mod.rs

crates/worker/src/partition_processor_manager/processor_state.rs

crates/worker/src/partition_processor_manager/mod.rs

pcholakov · 2024-11-19T16:53:28Z

crates/worker/src/partition_processor_manager/mod.rs

@@ -656,7 +696,7 @@ impl PartitionProcessorManager {
        }
    }

-    fn request_partition_snapshots(&mut self) {
+    fn trigger_periodic_partition_snapshots(&mut self) {


I had missed this in this in the initial revision, added now! (Line 748.)

tillrohrmann

Thanks for updating this PR @pcholakov. I think we are very close to merge it :-) I left a few comments/questions.

tillrohrmann · 2024-11-20T13:52:08Z

crates/worker/src/partition_processor_manager/mod.rs

+                if self.pending_snapshots.contains_key(&partition_id) {
+                    warn!(%partition_id, "Partition processor stopped while snapshot task is still pending.");
+                }


The partition processor didn't stop yet. It is stopping. When it stopped, on_asynchronous_event will be called with EventKind::Stopped.

Updated wording to be more precise. Just to test my understanding: while we've already called processor.cancel() at this point, that doesn't necessarily mean the cancellation token effect has propagated yet?

tillrohrmann · 2024-11-20T13:56:14Z

crates/worker/src/partition_processor_manager/mod.rs

+        if let Some(pending) = self.pending_snapshots.remove(&partition_id) {
+            let _ = pending.sender.send(response);
+        } else {
+            error!("Snapshot task result received, but there was no pending sender found!")


error is usually used for situations where the cluster is at risk of failing. Maybe a lower log level is ok for this situation (even though I currently can't see how senders will disappear from pending_snapshots so it probably shouldn't happen).

I went back and forth on this. It's definitely not critical to system stability, so I'll update this back to warn but it does indicate a potential bug.

tillrohrmann · 2024-11-20T14:01:02Z

crates/worker/src/partition_processor_manager/mod.rs

+                TaskKind::PartitionSnapshotProducer,
+                "create-snapshot",
+                Some(partition_id),
+                async move { create_snapshot_task.run().await }.instrument(snapshot_span),


nit: Instrumenting run via #[instrument()] would have the benefit that it's clearer whats being emitted via this span (snapshot_id and partition_id). Now if one touches the SnapshotPartitionTask one needs to be aware of this detail (someone else attaches a span) that lives in a different file.

Huge improvement, thank you!

tillrohrmann · 2024-11-20T14:02:47Z

crates/worker/src/partition_processor_manager/processor_state.rs

+    /// The Partition Processor is in a state in which it is acceptable to create and publish
+    /// snapshots. Since we don't want newer snapshots to move backwards in applied LSN, the current
+    /// implementation checks whether the processor is fully caught up with the log.


From correctness pov, it wouldn't be a problem if newer snapshots would go backwards in LSN, right?

No, definitely not! It's more a question of how aggressively do we want to trim - if we wanted to leave precisely one snapshot in the repository and trim the log to its LSN, then we should make a much greater effort to ensure we don't move backwards. But I suspect that once you put a snapshot in the repository, it will probably stay there on the order of days to months before it gets pruned.

With some upcoming changes it will get cheap to do a check of the latest snapshot LSN in the repo, so we'll be able to easily add a condition on the producer not to put lower LSN snapshots in the very near future, iff we wanted this to be strictly monotonic.

tillrohrmann · 2024-11-20T14:07:44Z

crates/worker/src/partition_processor_manager/mod.rs

@@ -656,7 +696,7 @@ impl PartitionProcessorManager {
        }
    }

-    fn request_partition_snapshots(&mut self) {
+    fn trigger_periodic_partition_snapshots(&mut self) {


Is there a reason why trigger_periodic_partition_snapshots sends a ProcessorsManagerCommand::CreateSnapshot instead of directly calling spawn_create_snapshot_task?

tillrohrmann · 2024-11-20T14:12:03Z

crates/worker/src/partition_processor_manager/snapshot_task.rs

+        let _ = self.result_sender.send(match result {
+            Ok(metadata) => {
+                debug!(
+                    archived_lsn = %metadata.min_applied_lsn,
+                    "Partition snapshot created"
+                );
+                Ok(metadata)
+            }
+            Err(err) => {
+                warn!("Failed to create partition snapshot: {}", err);
+                Err(err)
+            }
+        });


Is it possible to directly return result? Then we don't have to introduce the oneshot that one needs to keep track of.

Yes! This is much cleaner; I tried this in an earlier iteration and hit some channel ownership issues which are now gone by virtue of other simplifications.

tillrohrmann · 2024-11-20T14:15:25Z

crates/worker/src/partition_processor_manager/mod.rs

+                Some(result) = self.snapshot_export_tasks.next() => {
+                    if let Ok(result) = result {
+                        self.on_create_snapshot_task_completed(result);
+                    } else {
+                        debug!("Create snapshot task failed: {}", result.unwrap_err()); // shutting down
+                    }
+                }


I like polling TaskHandle a bit more than introducing a new indirection via the oneshots. One advantage of polling the TaskHandle is that one would also see panics that might crash the task. The oneshot would be closed in this case.

I can see how that is an improvement! I've switched to polling a FuturesUnordered<TaskHandle<_>> now :-)

This change moves the responsibility for orchestrating snapshot creation to the PartitionProcessorManager, allowing the PartitionProcessor to be more focused on its core task of processing journal operations.

…apshots

pcholakov · 2024-11-20T16:58:10Z

Thanks for all the input, @tillrohrmann! :-) Ready for another pass whenever you get a chance.

tillrohrmann

Thanks for updating the PR @pcholakov. It looks really good. I have one question left which is about the behavior of spawn_create_snapshot_task when we pass in None as sender. Did I understand it correctly that we won't remember a spawned snapshot task in this case? So if another snapshot request comes while the previous task is still running, would we create another task?

tillrohrmann · 2024-11-20T21:25:41Z

crates/worker/src/partition_processor_manager/mod.rs

+
+    pending_snapshots: HashMap<PartitionId, oneshot::Sender<SnapshotResult>>,
+    snapshot_export_tasks:
+        FuturesUnordered<TaskHandle<Result<PartitionSnapshotMetadata, SnapshotError>>>,


nit: You could use SnapshotResultInternal here.

tillrohrmann · 2024-11-20T21:26:11Z

crates/worker/src/partition_processor_manager/mod.rs

+                    if let Ok(result) = result {
+                        self.on_create_snapshot_task_completed(result);
+                    } else {
+                        debug!("Create snapshot task failed: {}", result.unwrap_err()); // shutting down


Suggested change

debug!("Create snapshot task failed: {}", result.unwrap_err()); // shutting down

debug!("Create snapshot task failed: {}", result.unwrap_err());

tillrohrmann · 2024-11-20T21:30:44Z

crates/worker/src/partition_processor_manager/mod.rs

+                    if let Some(sender) = sender {
+                        entry.insert(sender);
                    }


What if sender == None? Will this mean that we create a snapshot task but don't remember that it is in progress?

I now explicitly call with sender = None when requesting a snapshot from the PartitionProcessorManager. We still have a handle to the task in snapshot_export_tasks, but there's no one outside of the PPM to notify about the result. A oneshot channel is now only used to respond to a CreateSnapshot RPC.

Can't it then happen that we start multiple snapshot tasks for the same partition if snapshotting takes a long time, for example? I was under the impression that we wanted to allow only a single in-flight snapshot task per partition at any point in time to control the resource usage.

Doh, this was a silly bug to introduce at the last minute! Thanks for catching this.

tillrohrmann · 2024-11-20T21:33:14Z

crates/worker/src/partition_processor_manager/snapshot_task.rs

+async fn create_snapshot_inner(
+    snapshot_id: SnapshotId,
+    partition_id: PartitionId,
+    partition_store_manager: PartitionStoreManager,
+    snapshot_base_path: PathBuf,
+    cluster_name: String,
+    node_name: String,
+) -> Result<PartitionSnapshotMetadata, SnapshotError> {


Was there a reason to make this function not a method of SnapshotPartitionTask? If it were, then you wouldn't have to pass in all the parameters explicitly. Instead it could accept self.

I think this is a leftover from the earlier channels-based communication, will revisit.

Moved this and write metadata to methods, also got rid of some clones in the process! 👍

…ultiple exports

…ate some clones

tillrohrmann

Thanks for putting up with all my requests @pcholakov. I think it looks really nice now. +1 for merging :-)

pcholakov · 2024-11-21T10:16:39Z

Not at all, @tillrohrmann - thanks for keeping the bar high and highlighting so many problems and improvements! 🔥

pcholakov force-pushed the feat/automatic-snapshots-every-n-records branch from 17e437f to a39b4bd Compare November 15, 2024 14:30

pcholakov force-pushed the refactor/snapshots-to-ppm branch 4 times, most recently from 0a64a61 to 494bccc Compare November 15, 2024 15:37

pcholakov commented Nov 15, 2024

View reviewed changes

pcholakov requested review from tillrohrmann and AhmedSoliman November 15, 2024 15:39

pcholakov marked this pull request as ready for review November 15, 2024 15:39

pcholakov mentioned this pull request Nov 15, 2024

Create snapshots on leading Partition Processors #2253

Merged

pcholakov commented Nov 15, 2024

View reviewed changes

crates/partition-store/src/partition_store_manager.rs Outdated Show resolved Hide resolved

pcholakov force-pushed the refactor/snapshots-to-ppm branch from 494bccc to c3f448b Compare November 15, 2024 17:43

tillrohrmann reviewed Nov 15, 2024

View reviewed changes

Base automatically changed from feat/automatic-snapshots-every-n-records to main November 15, 2024 20:14

pcholakov force-pushed the refactor/snapshots-to-ppm branch from e6ffde8 to 693c174 Compare November 18, 2024 08:31

pcholakov commented Nov 18, 2024

View reviewed changes

crates/worker/src/partition_processor_manager/mod.rs Outdated Show resolved Hide resolved

pcholakov mentioned this pull request Nov 18, 2024

Introduce SnapshotRepository and object store integration #2310

Open

tillrohrmann reviewed Nov 18, 2024

View reviewed changes

AhmedSoliman removed their request for review November 19, 2024 11:47

pcholakov commented Nov 19, 2024

View reviewed changes

pcholakov requested a review from tillrohrmann November 19, 2024 16:54

tillrohrmann reviewed Nov 20, 2024

View reviewed changes

pcholakov added 7 commits November 20, 2024 11:57

Refactoring: Move partition snapshotting to PPM

19e2187

This change moves the responsibility for orchestrating snapshot creation to the PartitionProcessorManager, allowing the PartitionProcessor to be more focused on its core task of processing journal operations.

Move archived LSN watches purely into the PPM

331ca94

Minor fixups

214a125

Improve error propagation to sender in SnapshotPartitionTask

8a1c304

Split snapshot export task from responding to sender

333415e

PSM::export_partition_snapshot() returns concrete SnapshotErrors

d0f411c

PPM cancels outstanding snapshot tasks on shutdown

473a9a0

pcholakov added 5 commits November 20, 2024 11:57

Multiple review feedback cleanups

dab5312

Replace archived LSN watch channels with simple hashmap managed by PPM

4682809

Rename ProcessorState method

06487b1

Review feedback: logging & instrumentation

5fa2226

Simplify snapshot task tracking, spawn tasks directly for periodic sn…

d686e7e

…apshots

pcholakov force-pushed the refactor/snapshots-to-ppm branch from 9f6d162 to d686e7e Compare November 20, 2024 16:57

pcholakov requested a review from tillrohrmann November 20, 2024 16:57

tillrohrmann reviewed Nov 20, 2024

View reviewed changes

pcholakov added 3 commits November 21, 2024 05:06

Restore PendingSnapshotTask tracking, fix bug which allows starting m…

2a7426f

…ultiple exports

Move standalone functions to methods in SnapshotPartitionTask, elimin…

ca3a2e9

…ate some clones

Minor PR feedback cleanups

55f2941

tillrohrmann approved these changes Nov 21, 2024

View reviewed changes

pcholakov merged commit 6293d25 into main Nov 21, 2024
11 checks passed

pcholakov deleted the refactor/snapshots-to-ppm branch November 21, 2024 10:54

		PartitionProcessorControlCommand::RunForLeader(leader_epoch) => {
		RunForLeader(leader_epoch) => {

	debug!("Create snapshot task failed: {}", result.unwrap_err()); // shutting down
	debug!("Create snapshot task failed: {}", result.unwrap_err());

Refactoring: Move partition snapshotting to PPM #2303

Refactoring: Move partition snapshotting to PPM #2303

Conversation

pcholakov commented Nov 15, 2024 • edited Loading

github-actions bot commented Nov 15, 2024 • edited Loading

Test Results

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Nov 20, 2024 • edited Loading

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

pcholakov commented Nov 21, 2024

pcholakov commented Nov 15, 2024 •

edited

Loading

github-actions bot commented Nov 15, 2024 •

edited

Loading

pcholakov Nov 20, 2024 •

edited

Loading

pcholakov commented Nov 20, 2024 •

edited

Loading

tillrohrmann Nov 20, 2024 •

edited

Loading