Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11627. Support getBlock operation on short-circuit channel. #7456

Draft
wants to merge 1 commit into
base: HDDS-10685
Choose a base branch
from

Conversation

ChenSammi
Copy link
Contributor

@ChenSammi ChenSammi commented Nov 19, 2024

What changes were proposed in this pull request?

  1. support DomainSocket server in Datanode
  2. support new client for DomainSocket
  3. provide the getBlock operation on DomainSocket channel
  4. read data from InputStream passed through DomainSocket
  5. add DomainSocket related metrics in Datanode

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11627

How was this patch tested?

new unit tests and integration tests.

@ChenSammi ChenSammi marked this pull request as draft November 19, 2024 15:07
@ChenSammi
Copy link
Contributor Author

Current CI build will fail due to new DomainSocket#close(boolean) API is not available in Hadoop common 3.3.6 jar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to include binary like this?

Does the binary come from a released hadoop version? If so, should we write a script snippet to download the tarball and extract the binary from that instead?

Copy link
Contributor Author

@ChenSammi ChenSammi Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for integration test, like the ozone-site.xml files under many test/resources folders. It will not be used in the final dist package and deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it. Still, this is a bit hacky IMO :)

Can we at least document where you get/how you build the binaries?

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChenSammi for the patch.

I've only looked at test cases and some proto changes at this point. Here are my comments so far.

long bcsid = 3L;
String datanodeID = UUID.randomUUID().toString();
ContainerProtos.DatanodeBlockID.Builder blkIDBuilder =
ContainerProtos.DatanodeBlockID.newBuilder().setContainerID(containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
ContainerProtos.DatanodeBlockID.newBuilder().setContainerID(containerID)
ContainerProtos.DatanodeBlockID.newBuilder()
.setContainerID(containerID)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerID is repeated in the base ContainerCommandRequest below as well.

This maybe unrelated to this jira, but can we get rid of this duplication?

request = ContainerProtos.ContainerCommandRequestProto.parseFrom(requestInBytes);
assertTrue(request.hasGetBlock());
assertEquals(ContainerProtos.Type.GetBlock, request.getCmdType());
assertEquals(containerID, request.getContainerID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we still have this duplicated field, we should check both

Suggested change
assertEquals(containerID, request.getContainerID());
assertEquals(containerID, request.getContainerID());
assertEquals(containerID, request.getGetBlock().getBlockID().getContainerID());

@@ -287,6 +287,7 @@ enum ReplicationType {
STAND_ALONE = 2;
CHAINED = 3;
EC = 4;
SHORT_CIRCUIT = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is SHORT_CIRCUIT a ReplicationType? This looks hacky to me

Comment on lines 99 to +100
XceiverClientSpi client1 = clientManager
.acquireClient(container1.getPipeline());
.acquireClientForReadData(container1.getPipeline(), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we parameterize this? Both allowShortCircuit = {true, false} should be tested IMO

Comment on lines +90 to +91
XceiverClientSpi client2 = clientManager.acquireClientForReadData(container1.getPipeline(), true);
assertTrue(client2 instanceof XceiverClientShortCircuit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible to get a non-SCR client (client3) on container1 pipeline at this point?
If so, can we also add a few lines in this test case for that? e.g.:

      XceiverClientSpi client3 = clientManager.acquireClientForReadData(container1.getPipeline(), false);
      assertTrue(client3 instanceof XceiverClientRatis);

Comment on lines +271 to +272
@Test
public void testReadWrite1() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test case need a timeout? Use class-wide timeout if so.

Comment on lines +268 to +292
/**
* Test a successful connection and then read/write.
*/
@Test
public void testReadWrite1() throws IOException {
testReadWrite(false, false);
}

/**
* On Linux, when there is still open file handle of a deleted file, the file handle remains open and can still
* be used to read and write the file.
*/
@Test
@Timeout(30)
public void testReadWrite2() throws IOException {
testReadWrite(true, false);
}

@Test
@Timeout(30)
public void testReadWrite3() throws IOException {
testReadWrite(false, true);
}

private void testReadWrite(boolean deleteFileBeforeRead, boolean deleteFileDuringRead) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using @ParameterizedTest instead like:

@ParameterizedTest
@CsvSource({
"true, true",
"true, false",
"false, true",
"false, false",
})
public void testUpgrade(boolean haEnabledBefore,

@Timeout(30)
public void testReadTimeout() throws InterruptedException {
conf.set(OzoneClientConfig.OZONE_DOMAIN_SOCKET_PATH, new File(dir, "ozone-socket").getAbsolutePath());
conf.set(OzoneConfigKeys.OZONE_CLIENT_READ_TIMEOUT, "5s");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we decrease this to 2s to reduce sleep() time without introducing flakiness?

Comment on lines +472 to +473
public void testMaxXceiverCount() throws IOException, InterruptedException {
conf.set(OzoneClientConfig.OZONE_DOMAIN_SOCKET_PATH, new File(dir, "ozone-socket").getAbsolutePath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OzoneConfigKeys.OZONE_CLIENT_READ_TIMEOUT is not explicitly set in this test case.

Should set it here as well?

try {
// temporary disable short-circuit read
long pathExpireDuration = factory.getPathExpireMills();
factory.disableShortCircuit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableShortCircuit() disables short circuit for a short while, and schedules it to be enabled after a set period of time.

Can you help me understand why this is useful?

@@ -34,6 +34,8 @@ public enum DatanodeVersion implements ComponentVersion {
COMBINED_PUTBLOCK_WRITECHUNK_RPC(2, "WriteChunk can optionally support " +
"a PutBlock request"),

SHORT_CIRCUIT_READS(3, "Support short-circuit reads."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SHORT_CIRCUIT_READS(3, "Support short-circuit reads."),
SHORT_CIRCUIT_READS(3, "Version with short-circuit read support."),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants