Managed + Iceberg IO #5494

kellen · 2024-09-17T20:20:07Z

Adds support for Beam's managed transforms and for Iceberg, which is implemented as a managed transform.

Note this is on a snapshot of magnolify, will need a release before merge.

codecov · 2024-09-17T21:06:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.32%. Comparing base (a464955) to head (d3b67dc).
Report is 27 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5494      +/-   ##
==========================================
- Coverage   61.32%   61.32%   -0.01%     
==========================================
  Files         312      312              
  Lines       11080    11082       +2     
  Branches      770      728      -42     
==========================================
+ Hits         6795     6796       +1     
- Misses       4285     4286       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RustedBones

Can you add the new module in the README ?

RustedBones · 2024-09-18T07:36:50Z

integration/src/test/scala/com/spotify/scio/iceberg/IcebergIOIT.scala

+      new Schema(
+        NestedField.required(0, "a", IntegerType.get()),
+        NestedField.required(1, "b", StringType.get())
+      ),


Can't this be given by the RowType ?

No, this is an Iceberg schema rather than the Beam schema. The lack of create-on-write does raise the question of whether we also need to derive the iceberg schemas

Maybe RowType could also offer a def icebergSchema? (Similar to how magnolify-parquet has both def schema and def avroSchema...)

That would mean pulling in iceberg deps into the beam module fyi

could make it provided, I guess, but point taken

Seems like Beam should have a utility function for converting between Beam/Icerberg Schemas. They have similar stuff for BQ/Avro/BeamSchema interop. Maybe we could contribute there

Beam does have this but it introduces a dep on the iceberg part of the sdk that in theory should be managed.

I could use it in the integration test directly but that wouldn't help users at all.

IcebergUtils.beamSchemaToIcebergSchema(rowType.schema)

OTOH ... the class has this comment so 🤷

// This is made public for users convenience, as many may have more experience working with // Iceberg types.

scio-managed/src/main/scala/com/spotify/scio/iceberg/IcebergIO.scala

RustedBones · 2024-09-18T08:12:44Z

scio-managed/src/main/scala/com/spotify/scio/managed/ManagedIO.scala

+  private lazy val _config: java.util.Map[String, Object] = {
+    // recursively convert this yaml-compatible nested scala map to java map
+    // we either do this or the user has to create nested java maps in scala code
+    // both are bad
+    def _convert(a: Object): Object = {
+      a match {
+        case m: Map[_, _] =>
+          m.asInstanceOf[Map[_, Object]].map { case (k, v) => k -> _convert(v) }.asJava
+        case i: Iterable[_] => i.map(o => _convert(o.asInstanceOf[Object])).asJava
+        case _              => a
+      }
+    }
+    config.map { case (k, v) => k -> _convert(v) }.asJava
+  }


I'm wondering if we should introduce either a config AST to ensure what is passed is YAML compatible, or maybe use lightbend config

The API can aslo take a yaml file location, e.g. classpath://foo.yaml if we wanted to support that.

scio-managed/src/main/scala/com/spotify/scio/managed/ManagedIO.scala

scio-managed/src/main/scala/com/spotify/scio/iceberg/IcebergIO.scala

RustedBones · 2024-09-18T08:43:12Z

Converted to draft until magnolify gets released

clairemcginty

overall looks good to me! the amount of testing+examples are impressive

scio-core/src/main/scala/com/spotify/scio/values/SCollection.scala

clairemcginty · 2024-09-18T17:40:44Z

scio-managed/src/main/scala/com/spotify/scio/managed/ManagedIO.scala

+import org.apache.beam.sdk.values.{PCollectionRowTuple, Row}
+import scala.jdk.CollectionConverters._
+
+final case class ManagedIO(ioName: String, config: Map[String, Object]) extends ScioIO[Row] {


i guess once this is merged, we can add a ManagedTypedIO to the 0.15.x branch? maybe let's file a ticket to track the Magnolify API work...

I don't think we need to support managed very extensively, so I decided to just skip any typed variant for Managed in favor of doing it downstream in our io-specific APIs

…cala Co-authored-by: Claire McGinty <[email protected]>

kellen added 10 commits September 6, 2024 15:50

wip

2207f5d

wip

5ed2f04

wip

d0a01e8

wip

05d1311

Merge branch 'main' into kellen/iceberg

20e758c

wip

85a8708

wip

aeed697

wip

eb933e3

wip

34fa5a9

wip

aa37670

wip

55ed828

RustedBones reviewed Sep 18, 2024

View reviewed changes

RustedBones marked this pull request as draft September 18, 2024 08:42

kellen added 2 commits September 18, 2024 06:23

wip

eb57321

wip

cd76b91

clairemcginty reviewed Sep 18, 2024

View reviewed changes

Update scio-core/src/main/scala/com/spotify/scio/values/SCollection.s…

d3b67dc

…cala Co-authored-by: Claire McGinty <[email protected]>

kellen mentioned this pull request Oct 7, 2024

[Feature Request]: IcebergIO should support table create-on-write apache/beam#32677

Closed

17 tasks

RustedBones added the io label Nov 27, 2024

RustedBones added this to the 0.15.0 milestone Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed + Iceberg IO #5494

Managed + Iceberg IO #5494

kellen commented Sep 17, 2024

codecov bot commented Sep 17, 2024 •

edited

Loading

RustedBones left a comment

RustedBones Sep 18, 2024

kellen Sep 18, 2024

clairemcginty Sep 18, 2024

kellen Sep 18, 2024

clairemcginty Sep 18, 2024

kellen Sep 18, 2024 •

edited

Loading

RustedBones Sep 18, 2024

kellen Sep 18, 2024

RustedBones commented Sep 18, 2024

clairemcginty left a comment

clairemcginty Sep 18, 2024

kellen Sep 18, 2024

Managed + Iceberg IO #5494

Are you sure you want to change the base?

Managed + Iceberg IO #5494

Conversation

kellen commented Sep 17, 2024

codecov bot commented Sep 17, 2024 • edited Loading

Codecov Report

RustedBones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kellen Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones commented Sep 18, 2024

clairemcginty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 17, 2024 •

edited

Loading

kellen Sep 18, 2024 •

edited

Loading