Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/adbc/driver/flightsql: support generic ingest #1107

Open
joellubi opened this issue Sep 26, 2023 · 24 comments
Open

go/adbc/driver/flightsql: support generic ingest #1107

joellubi opened this issue Sep 26, 2023 · 24 comments

Comments

@joellubi
Copy link
Member

I wanted to pose the question of what it would take to be able to support ingest with the flightsql driver. I understand that each driver is meant to supply its own specific implementation for ingestion, which makes doing so for a flightsql backend challenging because the driver wouldn't necessarily know the specifics of it's underlying representation or syntax.

I had a few thoughts on how this might be achieved:

  • One option would be to extend the flightsql spec to support ingest natively. This would enable any backend to implement its own ingest logic, and the adbc ingest would simply map to the flightsql ingest. This could be done by adding a new command message type to include in the FlightDescriptor of a DoPut call.
  • Another option, though perhaps longer term and less clear, would be to defer this to substrait. The problem is that a generic flightsql adbc driver would not know the UPDATE or INSERT or COPY syntax to submit as a query to the backend, but perhaps a substrait plan could abstract the "ingestion plan". Now I'm not very familiar with the details of the substrait spec and I've only seen it used for "SELECT" style queries, so this may not even align with its stated goals. I think the first option is likely a better fit.

Elaborating on how the first option might be implemented, here's an example of how the new message type might look:

message CommandStatementIngest {
  option (experimental) = true;

  enum IngestMode {
    INGEST_MODE_CREATE = 0;
    INGEST_MODE_APPEND = 1;
    INGEST_MODE_REPLACE = 2;
    INGEST_MODE_CREATE_APPEND = 3;
  }

  string target_table = 1;

  IngestMode mode = 2;
}

After receiving this in the FlightDescriptor, the flightsql server may then handle the subsequent stream with whichever means provide the desired throughput to the requested target.

I would appreciate any feedback on this approach, or links to prior context I may have missed. Thanks!

@lidavidm
Copy link
Member

Substrait can handle UPDATE etc. but I am not sure if any implementations support it.

Adding support to Flight SQL may be useful, a better place to discuss would be the Arrow dev@ mailing list

@lidavidm
Copy link
Member

Thanks for the post. I think this would be useful to support in Flight SQL. I'd propose some slight changes:

message CommandStatementIngest {
  option (experimental) = true;

  enum IngestMode {
    INGEST_MODE_UNSPECIFIED = 0;
    INGEST_MODE_CREATE = 1;
    INGEST_MODE_APPEND = 2;
    INGEST_MODE_REPLACE = 3;
    INGEST_MODE_CREATE_APPEND = 4;
  }

  IngestMode mode = 1;
  string target_table = 1;
  optional string target_schema = 2;
  optional string target_catalog = 3;
  // do we want db-specific parameters?
}

@lidavidm lidavidm changed the title driver/flightsql: support generic ingest go/adbc/driver/flightsql: support generic ingest Sep 27, 2023
@joellubi
Copy link
Member Author

Thanks for the suggestion, it makes sense to include those fields.

Regarding your question about db-specific parameters, I can see the following options:

  1. Add each option explicitly to the spec, but this would probably only work for parameters that generalize across many backends (i.e. not db-specific).
  2. Add a single field capable of storing arbitrary key-value pairs, like app_metadata.
  3. Use the existing app_metadata field of the initial FlightData sent as part of DoPut to include db-specific kv pairs.

I would appreciate your perspective on these options (or any others) @lidavidm.

@alamb
Copy link

alamb commented Sep 28, 2023

Thank you @lidavidm and @joellubi -- I also think this basic idea makes sense.

In terms of the modes, SQL systems I am familiar with typically support two types, which are typically dialect specific and fairly complex:

  1. INSERT/COPY: appends the new rows to the target table
  2. UPSERT/MERGE potentially updates existing rows if present, and inserts new rows if not (for example, snowflake)

In order to implement UPSERT / MERGE you typically need to specify the criteria of what qualifies as an update. For some systems this is done via a declaration of PRIAMARY KEY but many others you can also potentially specify a custom matching condition

Thus I suggest supporting bulk insert in a generic way via a SQL query rather than an enum and table names, which would constrain how this feature gets used.

Perhaps we could add something like the follow (to mirror Update). It would likely make sense to have a prepared statement version of this as well

/*
 * Represents a SQL bulk insert / upsert query. Used in the command member of FlightDescriptor
 * for the the RPC call DoPut to cause the server to execute the included SQL INSERT/COPY/UPSERT/MERGE or similar
 * command, with the data in the batches in the DoPut call.
 */
message CommandStatementInsert {
  option (experimental) = true;

  // The SQL syntax.
  string query = 1;
  // Include the query as part of this transaction (if unset, the query is auto-committed).
  optional bytes transaction_id = 2;
}

@lidavidm
Copy link
Member

I think the whole idea is to avoid having to know the specific SQL query (and I believe this is less for upsert and more for plain insert/copy; ADBC doesn't handle updates for its version of this). You are right we should include the transaction field, though.

@alamb
Copy link

alamb commented Sep 28, 2023

🤔 since Insert / merge are pretty specific to the system and not standardized the SQL does vary quite a bit. Maybe in this case there is no better option than to support the lowest common denominator in terms of functionality 🤔

What is the intended semantics of CREATE by the way? Is it meant to create a new table it it doesn't already exist?

If so, I wonder how that would work with SQL types, which are not typically the same as Arrow types (e.g. a SQL VARCHAR might be a Utf8Array or a LargeUtf8Array). Is the idea that each FlightSQL implementation would automatically provide a mapping to the types, and if users wanted more control (like defining specific types or keys, etc) the would use the native DDL commands (CREATE TABLE ...)?

@lidavidm
Copy link
Member

Yup, CREATE is 'create a new table, fail if already exists'.

Yes, that's pretty much what ADBC does here - the driver/server does its best to map the Arrow types to reasonable database types. If you need full control you'll have to use SQL yourself.

Substrait could provide some generic functionality but it's not well supported yet.

@lidavidm
Copy link
Member

Add a single field capable of storing arbitrary key-value pairs, like app_metadata.

@joellubi this would be my vote (just a bytes field, or possibly a bytes field + map<string, string>)

@joellubi
Copy link
Member Author

just a bytes field, or possibly a bytes field + map<string, string>

@lidavidm Could you please elaborate on what you mean by the latter option? Do you mean a bytes field encoding a map<string, string> or to actually use the native proto map field?

@lidavidm
Copy link
Member

the latter, because it would presumably be easier for most things; but we've stuck to bytes fields elsewhere in Flight

@joellubi
Copy link
Member Author

Got it. I agree it would be nice to use map though I'm unfamiliar if there are any prior decisions or precedent that might make bytes preferable.

If we went with map then this would be the current draft:

message CommandStatementIngest {
  option (experimental) = true;

  enum IngestMode {
    INGEST_MODE_UNSPECIFIED = 0;
    INGEST_MODE_CREATE = 1;
    INGEST_MODE_APPEND = 2;
    INGEST_MODE_REPLACE = 3;
    INGEST_MODE_CREATE_APPEND = 4;
  }

  IngestMode mode = 1;
  string target_table = 2;
  optional string target_schema = 3;
  optional string target_catalog = 4;
  map<string, string> options = 5;
}

Open to suggestions on the name for that field. app_metadata is used in other messages but it might be unintuitive to call it the same thing if it's not a bytes type.

@lidavidm
Copy link
Member

@zeroshade @ywc88 comments here?

@judahrand
Copy link
Contributor

judahrand commented Oct 2, 2023

Should we also incorporate the temporary flag which ADBC uses?

message CommandStatementIngest {
  option (experimental) = true;

  enum IngestMode {
    INGEST_MODE_UNSPECIFIED = 0;
    INGEST_MODE_CREATE = 1;
    INGEST_MODE_APPEND = 2;
    INGEST_MODE_REPLACE = 3;
    INGEST_MODE_CREATE_APPEND = 4;
  }

  IngestMode mode = 1;
  string target_table = 2;
  optional string target_schema = 3;
  optional string target_catalog = 4;
  optional bool temporary = 5;
  map<string, string> options = 1000;
}

@lidavidm
Copy link
Member

lidavidm commented Oct 2, 2023

Ah yes, good catch.

@zeroshade
Copy link
Member

In general this seems very reasonable to me and I like the idea. Though I think historically we've preferred enum defined option names rather than allowing arbitrary options?

Also, in proto3 all fields are optional so the optional tag on those fields is extraneous and unnecessary, and also invalid syntax for proto3 (technically we already end up having to use --experimental_allow_proto3_optional for FlightSql but I'd prefer to avoid adding more cases of optional in the proto file if we can avoid it)

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2023

The optional is to signify which parameters aren't actually required (that said it isn't a big deal here, I suppose). Newer versions of Protobuf will allow that by default and then it is just part of Protobuf.

The options here are arbitrary and backend-specific; there is no enum we can define.

@zeroshade
Copy link
Member

The options here are arbitrary and backend-specific; there is no enum we can define.

I suspected as much, just wanted to confirm. 😄

The optional is to signify which parameters aren't actually required (that said it isn't a big deal here, I suppose). Newer versions of Protobuf will allow that by default and then it is just part of Protobuf.

I didn't realize that they changed it so we can get rid of the --experimental_allow_proto3_optional, that's good. Though it looks like using it is discouraged and it's preferred to use the google.api.field_behavior field annotation instead?

@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2023

It's official: protocolbuffers/protobuf#10463

I think that annotation is more google API fluff and isn't really relevant here (nor would it affect codegen like the optional keyword)

@joellubi
Copy link
Member Author

Thanks for the feedback on this. I've started to put together some of these changes in preparation for a pull request but have had a few further questions come up.

First, here is the current draft of the proto definition I have:

  /*
   * Represents a bulk ingestion request. Used in the command member of FlightDescriptor
   * for the the RPC call DoPut to cause the server load the contents of the stream's
   * FlightData into the target destination.
   */
message CommandStatementIngest {
  option (experimental) = true;

  // Describes the behavior for loading bulk data.
  enum IngestMode {
    // Ingestion behavior unspecified.
    INGEST_MODE_UNSPECIFIED = 0;
    // Create the target table. Fail if the target table already exists.
    INGEST_MODE_CREATE = 1;
    // Append to an existing target table. Fail if the target table does not exist.
    INGEST_MODE_APPEND = 2;
    // Drop the target table if it exists. Then follow INGEST_MODE_CREATE behavior.
    INGEST_MODE_REPLACE = 3;
    // Create the target table if it does not exist. Then follow INGEST_MODE_APPEND behavior.
    INGEST_MODE_CREATE_APPEND = 4;
  }

  // The ingestion behavior.
  IngestMode mode = 1;
  // The table to load data into.
  string target_table = 2;
  // The db_schema of the target_table to load data into. If unset, ... (TODO)
  optional string target_schema = 3;
  // The catalog of the target_table to load data into. If unset, ... (TODO)
  optional string target_catalog = 4;
  // Use a temporary table for target_table.
  optional bool temporary = 5;
  // Backend-specific options.
  map<string, string> options = 1000;
}

A few open questions:

  1. How should the behavior of an unset target_schema or target_catalog be described? I suppose it would have to use the backend-specific default, if one exists. Would that be a reasonable specification of the behavior?
  2. @alamb had a comment about including optional bytes transaction_id as a field. I think this makes sense but it's not clear to me how a server should answer SqlInfo requests about SQL_TRANSACTIONS_SUPPORTED or FLIGHT_SQL_SERVER_TRANSACTION if it supports transactions for queries but not necessarily bulk ingestion, which could often times be the case. We could add more options to FLIGHT_SQL_SERVER_TRANSACTION to capture the distinction, but perhaps it might be simpler to leave it out of the spec and wait until transaction support for bulk ingestion is explicitly needed in the spec before making those extensions.
  3. A more general development/contributing question: Are there specific tool versions specified to run when generating files during development. For example there are several newer versions of protoc-gen-go than the one the go pb files are currently generated with (v1.28.1) and I was wondering if it would be an issue to bump these versions.

@lidavidm
Copy link
Member

  1. Yes, it's up to the backend. This will perhaps become clearer as the proposal for session state in Flight SQL gets worked out (since 'current' catalog/schema are part of that)
  2. I think we can include it up-front. It might be better to use a separate SqlInfo value for it, though.
  3. Is there a version in the go.mod?

@joellubi
Copy link
Member Author

@lidavidm

  1. Got it, thanks.
  2. Ok I'll make that change.
  3. Yes the version is currently v1.31.0, so it probably makes sense to use that version. I was a little confused because the file has still been generated with v1.28.1 despite go.mod indicating otherwise. TLDR it turns out that because protoc-gen-go is a plugin for protoc and not directly runnable via go run ..., there's not a simple way to tell //go:generate to run the version specified in go.mod. That's unfortunate but at least I can align my local environment with go.mod manually.

@joellubi
Copy link
Member Author

I've opened a draft PR in the arrow repository with the proposed changes and the go implementation. I'd appreciate any feedback on the specifics.

I thought that the following other changes might be required, but I wasn't sure:

  • Documentation
  • 2nd Implementation
  • Integration Tests

If any or all of these are required, I'm happy to add them to the PR.

@lidavidm
Copy link
Member

Thanks! All are required, but please ping the mailing list for opinions; once we have all the parts we can then hold a formal vote.

@joellubi
Copy link
Member Author

I've opened a second PR into which I've split the go implementation + integration tests, which should help add some color to the actual usage of these changes. I'd appreciate any feedback on that or the original format PR, as there have been some minor changes there as well. Thanks!

zeroshade pushed a commit to apache/arrow that referenced this issue Apr 17, 2024
### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: #38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…#38385)

### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: apache#38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…#38385)

### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: apache#38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…#38385)

### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: apache#38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…#38385)

### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: apache#38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@lidavidm lidavidm added this to the ADBC Libraries 13 milestone May 10, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…#38385)

### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: apache#38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
kou pushed a commit to apache/arrow-go that referenced this issue Aug 30, 2024
### Rationale for this change

It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate.

### What changes are included in this PR?

- Format:
  - Add CommandStatementIngest message type to Flight SQL proto definition
  - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo
- Go:
  - Generate pb
  - Server-side implementation
  - Client-side implementation
  - Unit + integration tests
- C++:
  - Server-side implementation
  - Client-side implementation
  - Integration tests

### Are these changes tested?

Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`.

### Are there any user-facing changes?

Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users.

* Closes: #38255

Lead-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Joel Lubinitsky <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants