A local file source plugin for CloudQuery that loads data from a file (in JSON, YAML, CSV or XLSX format) to any database, data warehouse or data lake supported by CloudQuery, such as PostgreSQL, BigQuery, Athena, and many more.
See the CloudQuery Quickstart for general information on how to configure the source and destination.
This source plugin can extract data from local files (given their path), apply some basic transformation and then provide it to CloudQuery for further processing and loading into the configured destinations.
You can find example configurations in the _test/
folder, with specific setting for CSV, Microsoft eXcel, JSON and YAML.
The basic configuration is as follow:
---
kind: source
spec:
name: test1
path: dihedron/file
version: v0.1.0
tables:
["*"]
destinations:
- sqlite
spec:
file: ./test.csv
format: csv
separator: ","
table:
name: T1
filter: _.color startsWith 'b'
columns:
- name: color
type: string
key: true
unique: true
notnull: true
- name: value
type: string
unique: true
notnull: true
- name: optimized
type: boolean
notnull: true
- name: count
type: integer
notnull: true
relations:
- name: T1_UPPER
filter: _.color startsWith 'blu'
columns:
- name: upper_color
type: string
key: true
unique: true
notnull: true
transform: "{{index .Row \"color\" | toString | upper}}"
- name: upper_value
type: string
transform: "{{index .Row \"value\" | toString | upper}}"
- name: T1_UPPER
filter: _.color startsWith 'bla'
columns:
- name: upper_color
type: string
key: true
unique: true
notnull: true
transform: "{{index .Row \"color\" | toString | upper}}"
- name: upper_value
type: string
transform: "{{index .Row \"value\" | toString | upper}}"
This source plugin does not export tables per se. You have to provide the information about the tables and colums that it should extract from the input file.
Thus, the plugin can export any kind of table, along with dependent tables ("relations") based on the metadata that has been provided in the plugin specific spec
section.
The first part of the configuration file sepcifies that the plugin (dihedron/file
) at version v0.1.0
(the latest) should import all tables into an sqlite
destination in CloudQuery.
The following spec
provides plugin-specific configuration:
- first of all it specifies the path to the file that provides the information to be imported into CloudQuery (
./test.csv
), that the file format iscsv
(other supported values arejsonm
,yaml
andxlsx
) and that the records are separated by a,
(only required forcsv
); CSV and XLSX files must provide a header row that is used to name the columns, whereas JSON and YAML files must be arrays of objects (i.e. start with[]
or with a list element- ...
); the column (CSV, XLSX) or object attribute (YAML; JSON) names are used in the following column specification section; - secondly, the configuration specifies the main table to be imported; each table has a
name
and can provide an optionalfilter
, which is an expression that is applied to each row and should return eithertrue
(in which case the row is sent to CloudQuery) orfalse
(the row is dropped); the expression is based on this rule engine grammar; the current row is addressed via the_
identifier and the fields are accessed as properties (_.color
); - the table's colums are enumerated and decribed; each column has a
name
, atype
, and can additionally specify whether it is part of the primary key (key: true
), whether it has unique values (unique: true
) and is non nullable (notnull: true
); moreover, there is atransform
property that provides a way to transform (or set to a constant value) the extracted value, both for data cleansing or for conditional extraction, according to Golang templates syntax; The templating engine has the whole of Sprig functions available; - last, a table can have dependent tables (
relations
), which are weak entities that are related to the main one; relations are useful when a single line in a CSV, XLSX, YAML or JSON file actually embeds multiple entities, e.g. a host (hostname
,serial
,ram
,cpus
...) and its (possibly multiple) dependent NICs (mac_address
,ip_address
,port_type
...).
You can declare the main table (e.g. table hosts
) and the dependent entities (e.g. a table for the host NICs, host_nics
) separately and then instruct the plugin to extract the different entities -- host and nics, even in 1:N cardinality -- automatically.
Refer to the provided tests to see how this mechanism works.
The source plugin is pretty feature complete.
It may be useful to add support for URLs in the file
field (which would probably be deprecated in favour of a more generic source
property) in order to have the plugin retrieve the file at the given URL prior to importing; support should be extended to at least HTTP(s), FTP, S3 and git URLs in addition to local files file://
.
The implementation could leverage Hashicorp's go-getter library.
make test
make lint
make gen-docs
- Run
git tag v1.0.0
to create a new tag for the release (replacev1.0.0
with the new version number) - Run
git push origin v1.0.0
to push the tag to GitHub
Once the tag is pushed, a new GitHub Actions workflow will be triggered to build the release binaries and create the new release on GitHub. To customize the release notes, see the Go releaser changelog configuration docs.