This is yet another Ruby library that binds to Apache Arrow in-memory query engine DataFusion.
This is an alternative to datafuion-contrib/datafusion-ruby. Please refer to FAQ below.
Gemfile
gem "arrow-datafusion"
App
require "datafusion"
ctx = Datafusion::SessionContext.new
# https://github.com/jychen7/arrow-datafusion-ruby/blob/main/spec/fixtures/test.csv
ctx.register_csv("csv", "test.csv")
results = ctx.sql("SELECT * FROM csv").collect
# results is array of Datafusion::RecordBatch
results.size # 1
# to_h converts Datafusion::RecordBatch to ruby Hash
results[0].to_h # {"int": [1, 2, 3, 4], "str": ["a", "b", "c", "d"], "float": [1.1, 2.2, 3.3, 4.4]}
SessionContext
- new
- register_csv
- sql
- register_json
- register_parquet
- register_udf
Dataframe
- new
- collect
- schema
- select_columns
- select
- filter
- aggregate
- sort
- limit
- show
- join
- explain
Please see Contribution Guide.
As of 2022-07, there are a few popular Ruby bindings for Rust, Rutie, Magnus and other alternatives. Magnus is picked because its API seems cleaner and it seems more clear about safe vs unsafe. The author of Magnus have a "maybe bias" comparison in this reddit thread. It is totally subjective and it should not be large effort if we decides to switch to different Ruby bindings fr Rust in future.
The module name Datafusion
follows the datafusion and datafusion-python. The gem name datafusion
is occupied in rubygems.org at 2016, so our gem is called arrow-datafusion
.
Similarly to the Ruby bindings of Arrow, its gem name is called red-arrow and the module is called arrow
.
datafuion-contrib/datafusion-python was the first bindings of Arrow Datafusion (Rust). It was implemented using pyo3 for Rust -> Python
. Besides Python, Datafusion Community also want to have Java and other language bindings. In order to share development resource, datafuion-contrib/datafusion-c is created and be used in datafuion-contrib/datafusion-ruby. This Rust -> C -> Ruby/Python/Java/etc
implementation is published as gem "red-datafusion" and couple with "red-arrow".
Around similar time when "red-datafusion" is created, I want to use Arrow DataFusion in Ruby, mainly to query Object Store like S3/GCS, so I create a Rust -> Ruby
bindings using Magnus. So I just keep this Rust -> Ruby
implementation as an alternative and publish it as gem arrow-datafusion
. To keep it simple, "arrow-datafusion" does not couple with "red-arrow" at the moment.
ps: Datafusion Python was coupled with PyArrow. There is a proposal to separate them in medium to long term. For detail, please refer to Can datafusion-python be used without pyarrow?.