-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go/adbc/driver/flightsql: support generic ingest #1107
Comments
Substrait can handle UPDATE etc. but I am not sure if any implementations support it. Adding support to Flight SQL may be useful, a better place to discuss would be the Arrow dev@ mailing list |
Thanks for the post. I think this would be useful to support in Flight SQL. I'd propose some slight changes: message CommandStatementIngest {
option (experimental) = true;
enum IngestMode {
INGEST_MODE_UNSPECIFIED = 0;
INGEST_MODE_CREATE = 1;
INGEST_MODE_APPEND = 2;
INGEST_MODE_REPLACE = 3;
INGEST_MODE_CREATE_APPEND = 4;
}
IngestMode mode = 1;
string target_table = 1;
optional string target_schema = 2;
optional string target_catalog = 3;
// do we want db-specific parameters?
} |
Thanks for the suggestion, it makes sense to include those fields. Regarding your question about db-specific parameters, I can see the following options:
I would appreciate your perspective on these options (or any others) @lidavidm. |
Thank you @lidavidm and @joellubi -- I also think this basic idea makes sense. In terms of the modes, SQL systems I am familiar with typically support two types, which are typically dialect specific and fairly complex:
In order to implement Thus I suggest supporting bulk insert in a generic way via a SQL query rather than an enum and table names, which would constrain how this feature gets used. Perhaps we could add something like the follow (to mirror Update). It would likely make sense to have a prepared statement version of this as well /*
* Represents a SQL bulk insert / upsert query. Used in the command member of FlightDescriptor
* for the the RPC call DoPut to cause the server to execute the included SQL INSERT/COPY/UPSERT/MERGE or similar
* command, with the data in the batches in the DoPut call.
*/
message CommandStatementInsert {
option (experimental) = true;
// The SQL syntax.
string query = 1;
// Include the query as part of this transaction (if unset, the query is auto-committed).
optional bytes transaction_id = 2;
} |
I think the whole idea is to avoid having to know the specific SQL query (and I believe this is less for upsert and more for plain insert/copy; ADBC doesn't handle updates for its version of this). You are right we should include the transaction field, though. |
🤔 since Insert / merge are pretty specific to the system and not standardized the SQL does vary quite a bit. Maybe in this case there is no better option than to support the lowest common denominator in terms of functionality 🤔 What is the intended semantics of If so, I wonder how that would work with SQL types, which are not typically the same as Arrow types (e.g. a SQL |
Yup, CREATE is 'create a new table, fail if already exists'. Yes, that's pretty much what ADBC does here - the driver/server does its best to map the Arrow types to reasonable database types. If you need full control you'll have to use SQL yourself. Substrait could provide some generic functionality but it's not well supported yet. |
@joellubi this would be my vote (just a bytes field, or possibly a bytes field + |
the latter, because it would presumably be easier for most things; but we've stuck to bytes fields elsewhere in Flight |
Got it. I agree it would be nice to use If we went with message CommandStatementIngest {
option (experimental) = true;
enum IngestMode {
INGEST_MODE_UNSPECIFIED = 0;
INGEST_MODE_CREATE = 1;
INGEST_MODE_APPEND = 2;
INGEST_MODE_REPLACE = 3;
INGEST_MODE_CREATE_APPEND = 4;
}
IngestMode mode = 1;
string target_table = 2;
optional string target_schema = 3;
optional string target_catalog = 4;
map<string, string> options = 5;
} Open to suggestions on the name for that field. |
@zeroshade @ywc88 comments here? |
Should we also incorporate the message CommandStatementIngest {
option (experimental) = true;
enum IngestMode {
INGEST_MODE_UNSPECIFIED = 0;
INGEST_MODE_CREATE = 1;
INGEST_MODE_APPEND = 2;
INGEST_MODE_REPLACE = 3;
INGEST_MODE_CREATE_APPEND = 4;
}
IngestMode mode = 1;
string target_table = 2;
optional string target_schema = 3;
optional string target_catalog = 4;
optional bool temporary = 5;
map<string, string> options = 1000;
} |
Ah yes, good catch. |
In general this seems very reasonable to me and I like the idea. Though I think historically we've preferred enum defined option names rather than allowing arbitrary options? Also, in proto3 all fields are |
The The options here are arbitrary and backend-specific; there is no enum we can define. |
I suspected as much, just wanted to confirm. 😄
I didn't realize that they changed it so we can get rid of the |
It's official: protocolbuffers/protobuf#10463 I think that annotation is more google API fluff and isn't really relevant here (nor would it affect codegen like the |
Thanks for the feedback on this. I've started to put together some of these changes in preparation for a pull request but have had a few further questions come up. First, here is the current draft of the proto definition I have: /*
* Represents a bulk ingestion request. Used in the command member of FlightDescriptor
* for the the RPC call DoPut to cause the server load the contents of the stream's
* FlightData into the target destination.
*/
message CommandStatementIngest {
option (experimental) = true;
// Describes the behavior for loading bulk data.
enum IngestMode {
// Ingestion behavior unspecified.
INGEST_MODE_UNSPECIFIED = 0;
// Create the target table. Fail if the target table already exists.
INGEST_MODE_CREATE = 1;
// Append to an existing target table. Fail if the target table does not exist.
INGEST_MODE_APPEND = 2;
// Drop the target table if it exists. Then follow INGEST_MODE_CREATE behavior.
INGEST_MODE_REPLACE = 3;
// Create the target table if it does not exist. Then follow INGEST_MODE_APPEND behavior.
INGEST_MODE_CREATE_APPEND = 4;
}
// The ingestion behavior.
IngestMode mode = 1;
// The table to load data into.
string target_table = 2;
// The db_schema of the target_table to load data into. If unset, ... (TODO)
optional string target_schema = 3;
// The catalog of the target_table to load data into. If unset, ... (TODO)
optional string target_catalog = 4;
// Use a temporary table for target_table.
optional bool temporary = 5;
// Backend-specific options.
map<string, string> options = 1000;
} A few open questions:
|
|
|
I've opened a draft PR in the arrow repository with the proposed changes and the go implementation. I'd appreciate any feedback on the specifics. I thought that the following other changes might be required, but I wasn't sure:
If any or all of these are required, I'm happy to add them to the PR. |
Thanks! All are required, but please ping the mailing list for opinions; once we have all the parts we can then hold a formal vote. |
I've opened a second PR into which I've split the go implementation + integration tests, which should help add some color to the actual usage of these changes. I'd appreciate any feedback on that or the original format PR, as there have been some minor changes there as well. Thanks! |
### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: #38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…#38385) ### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: apache#38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…#38385) ### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: apache#38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…#38385) ### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: apache#38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…#38385) ### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: apache#38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…#38385) ### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: apache#38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
### Rationale for this change It was suggested in the discussion around apache/arrow-adbc#1107 for the Flight SQL ADBC driver that an "Ingest" command would be a helpful addition to the Flight SQL specification. This command would enable a Flight SQL client to provide a FlightData stream to the server without needing to know its SQL syntax, and have that stream loaded into a target table by whichever means the server deems appropriate. ### What changes are included in this PR? - Format: - Add CommandStatementIngest message type to Flight SQL proto definition - Add FLIGHT_SQL_SERVER_BULK_INGESTION and FLIGHT_SQL_SERVER_INGEST_TRANSACTIONS_SUPPORTED options for SqlInfo - Go: - Generate pb - Server-side implementation - Client-side implementation - Unit + integration tests - C++: - Server-side implementation - Client-side implementation - Integration tests ### Are these changes tested? Yes, see `server_test.go`, `scenario.go`, and `test_integration.cc`. ### Are there any user-facing changes? Yes, new Flight SQL client and server functionality is being added. Changes are not expected to break existing users. * Closes: #38255 Lead-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Joel Lubinitsky <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Matt Topol <[email protected]>
I wanted to pose the question of what it would take to be able to support ingest with the flightsql driver. I understand that each driver is meant to supply its own specific implementation for ingestion, which makes doing so for a flightsql backend challenging because the driver wouldn't necessarily know the specifics of it's underlying representation or syntax.
I had a few thoughts on how this might be achieved:
UPDATE
orINSERT
orCOPY
syntax to submit as a query to the backend, but perhaps a substrait plan could abstract the "ingestion plan". Now I'm not very familiar with the details of the substrait spec and I've only seen it used for "SELECT" style queries, so this may not even align with its stated goals. I think the first option is likely a better fit.Elaborating on how the first option might be implemented, here's an example of how the new message type might look:
After receiving this in the FlightDescriptor, the flightsql server may then handle the subsequent stream with whichever means provide the desired throughput to the requested target.
I would appreciate any feedback on this approach, or links to prior context I may have missed. Thanks!
The text was updated successfully, but these errors were encountered: