Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support protobuf struct type #206

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

tscottcoombes1
Copy link
Contributor

@tscottcoombes1 tscottcoombes1 commented Nov 26, 2024

Rationale for this change

extra support for protobuf common types, converting to parquet

What changes are included in this PR?

support for protobuf struct and tests

Are these changes tested?

yes

Are there any user-facing changes?

no, just support for new type

@tscottcoombes1 tscottcoombes1 marked this pull request as draft November 26, 2024 15:12
@tscottcoombes1 tscottcoombes1 changed the title break the tests with struct type support protobuf struct type Nov 26, 2024
@tscottcoombes1
Copy link
Contributor Author

@zeroshade can I get your thoughts on this?

spark follows this pattern: https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html#handling-circular-references-protobuf-fields

syntax = "proto3"
message Person {
  string name = 1;
  Person bff = 2
}

// The protobuf schema defined above, would be converted into a Spark SQL columns with the following
// structure based on `recursive.fields.max.depth` value.

0: struct<name: string, bff: null>
1: struct<name string, bff: <name: string, bff: null>>
2: struct<name string, bff: <name: string, bff: struct<name: string, bff: null>>> ...

but parquet can't do null types

my idea, is if we get to depth = x, then we drop the field? e.g. for 2

2: struct<name string, bff: <name: string, bff: struct<name: string>>>

@tscottcoombes1
Copy link
Contributor Author

@zeroshade otherwise, as this is used for json type fields, we could convert to a json string. and not need to think about recursive depths

@zeroshade
Copy link
Member

if the intent here is to use it for specifically JSON fields, then i would say we just use a string and utilize the Canonical JSON extension type so that it gets stored in parquet as a JSON typed field.

@tscottcoombes1 tscottcoombes1 marked this pull request as ready for review November 27, 2024 18:04
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few nit picks

Comment on lines +21 to +22
"google.golang.org/protobuf/encoding/protojson"
"google.golang.org/protobuf/types/known/structpb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you shift these includes to be grouped with the others of the same prefix below?

Comment on lines 21 to +22
"fmt"
"google.golang.org/protobuf/types/known/structpb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, please place this with the sorted imports below

@zeroshade
Copy link
Member

@tscottcoombes1 Running pre-commit should be sufficient to fix the linting issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants