Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11773. Prevent frequent DataNode Ratis snapshotting. #7473

Merged
merged 3 commits into from
Nov 26, 2024

Conversation

jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-11773. Bump hdds.ratis.snapshot.threshold and hdds.container.ratis.statemachine.max.pending.apply-transactions to 100k

Please describe your PR in detail:

  • DataNode is configured to snapshot every 10k transactions, which is too frequent for HBase workloads where DataNode can do thousands and eventually 10s of thousands of transactions per second.
  • Bump hdds.ratis.snapshot.threshold to 100k for now. Update hdds.container.ratis.statemachine.max.pending.apply-transactions to make it consistent.
  • We have to revisit and make it 1M eventually.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11773

How was this patch tested?

Applied the change to a HBase cluster. Previously it was snapshotting every 4-5 seconds, and now it is doing it about every minute.

Change-Id: I2baf863c537cc3f5b0e2905c2fb1ca88d05c0ff2
@jojochuang jojochuang changed the title HDDS-11773. Frequent DataNode Ratis snapshotting. HDDS-11773. Prevent frequent DataNode Ratis snapshotting. Nov 22, 2024
Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jojochuang .

But this also changes OM's default. It could be a concern when Ratis snapshot interval is set to a value too high for followers to catch up, thus failing OM bootstrapping? Is there an existing mechanism to tune this for Datanodes only?

What do you think? @szetszwo

@jojochuang
Copy link
Contributor Author

isn't it DataNode only?
SCM uses ozone.scm.ha.ratis.snapshot.threshold.
I guess OM uses the default value which is 400000.

@smengcl
Copy link
Contributor

smengcl commented Nov 23, 2024

isn't it DataNode only? SCM uses ozone.scm.ha.ratis.snapshot.threshold. I guess OM uses the default value which is 400000.

Right. Pls amend the config tag

@jojochuang
Copy link
Contributor Author

The snapshotting itself takes around the same amount of time as before, 1-3ms.

2024-11-22 22:31:12,432 INFO [e693615a-d484-4165-8446-dff08cac5978@group-D595E2D0A206-StateMachineUpdater]-org.apache.hadoop.ozone.container.common.transport.server.ratis.
ContainerStateMachine: group-D595E2D0A206: Taking a snapshot at:(t:179, i:7007978) file /var/lib/hadoop-ozone/datanode/ratis/data/c420af11-2786-4f5a-9b5a-d595e2d0a206/sm/s
napshot.179_7007978
2024-11-22 22:31:12,434 INFO [e693615a-d484-4165-8446-dff08cac5978@group-D595E2D0A206-StateMachineUpdater]-org.apache.hadoop.ozone.container.common.transport.server.ratis.
ContainerStateMachine: group-D595E2D0A206: Finished taking a snapshot at:(t:179, i:7007978) file:/var/lib/hadoop-ozone/datanode/ratis/data/c420af11-2786-4f5a-9b5a-d595e2d0
a206/sm/snapshot.179_7007978 took: 1 ms
2024-11-22 22:32:46,285 INFO [e693615a-d484-4165-8446-dff08cac5978@group-D595E2D0A206-StateMachineUpdater]-org.apache.hadoop.ozone.container.common.transport.server.ratis.
ContainerStateMachine: group-D595E2D0A206: Taking a snapshot at:(t:179, i:7107980) file /var/lib/hadoop-ozone/datanode/ratis/data/c420af11-2786-4f5a-9b5a-d595e2d0a206/sm/s
napshot.179_7107980
2024-11-22 22:32:46,287 INFO [e693615a-d484-4165-8446-dff08cac5978@group-D595E2D0A206-StateMachineUpdater]-org.apache.hadoop.ozone.container.common.transport.server.ratis.
ContainerStateMachine: group-D595E2D0A206: Finished taking a snapshot at:(t:179, i:7107980) file:/var/lib/hadoop-ozone/datanode/ratis/data/c420af11-2786-4f5a-9b5a-d595e2d0
a206/sm/snapshot.179_7107980 took: 2 ms

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jojochuang

@jojochuang jojochuang merged commit f0a2c87 into apache:master Nov 26, 2024
34 checks passed
@jojochuang
Copy link
Contributor Author

Thanks for the review, @smengcl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants