CloudQuery File Source Plugin

A local file source plugin for CloudQuery that loads data from a file (in JSON, YAML, CSV or XLSX format) to any database, data warehouse or data lake supported by CloudQuery, such as PostgreSQL, BigQuery, Athena, and many more.

Links

CloudQuery Quickstart Guide

Configuration

See the CloudQuery Quickstart for general information on how to configure the source and destination.

This source plugin can extract data from local files (given their path), apply some basic transformation and then provide it to CloudQuery for further processing and loading into the configured destinations.

You can find example configurations in the _test/ folder, with specific setting for CSV, Microsoft eXcel, JSON and YAML.

The basic configuration is as follow:

---
kind: source
spec:
  name: test1
  path: dihedron/file
  version: v0.1.0
  tables: 
    ["*"]
  destinations:
    - sqlite
  spec:
    file: ./test.csv
    format: csv
    separator: ","
    table: 
      name: T1
      filter: _.color startsWith 'b'
      columns:
        - name: color
          type: string
          key: true
          unique: true
          notnull: true
        - name: value
          type: string
          unique: true
          notnull: true
        - name: optimized
          type: boolean
          notnull: true
        - name: count
          type: integer
          notnull: true
    relations:
      - name: T1_UPPER
        filter: _.color startsWith 'blu'
        columns:
          - name: upper_color
            type: string
            key: true
            unique: true
            notnull: true
            transform: "{{index .Row \"color\" | toString | upper}}"
          - name: upper_value
            type: string
            transform: "{{index .Row \"value\" | toString | upper}}"
      - name: T1_UPPER
        filter: _.color startsWith 'bla'
        columns:
          - name: upper_color
            type: string
            key: true
            unique: true
            notnull: true
            transform: "{{index .Row \"color\" | toString | upper}}"
          - name: upper_value
            type: string
            transform: "{{index .Row \"value\" | toString | upper}}"

This source plugin does not export tables per se. You have to provide the information about the tables and colums that it should extract from the input file.

Thus, the plugin can export any kind of table, along with dependent tables ("relations") based on the metadata that has been provided in the plugin specific spec section.

The first part of the configuration file sepcifies that the plugin (dihedron/file) at version v0.1.0 (the latest) should import all tables into an sqlite destination in CloudQuery.

The following spec provides plugin-specific configuration:

first of all it specifies the path to the file that provides the information to be imported into CloudQuery (./test.csv), that the file format is csv (other supported values are jsonm, yaml and xlsx) and that the records are separated by a , (only required for csv); CSV and XLSX files must provide a header row that is used to name the columns, whereas JSON and YAML files must be arrays of objects (i.e. start with [] or with a list element - ...); the column (CSV, XLSX) or object attribute (YAML; JSON) names are used in the following column specification section;
secondly, the configuration specifies the main table to be imported; each table has a name and can provide an optional filter, which is an expression that is applied to each row and should return either true (in which case the row is sent to CloudQuery) or false (the row is dropped); the expression is based on this rule engine grammar; the current row is addressed via the _ identifier and the fields are accessed as properties (_.color);
the table's colums are enumerated and decribed; each column has a name, a type, and can additionally specify whether it is part of the primary key (key: true), whether it has unique values (unique: true) and is non nullable (notnull: true); moreover, there is a transform property that provides a way to transform (or set to a constant value) the extracted value, both for data cleansing or for conditional extraction, according to Golang templates syntax; The templating engine has the whole of Sprig functions available;
last, a table can have dependent tables (relations), which are weak entities that are related to the main one; relations are useful when a single line in a CSV, XLSX, YAML or JSON file actually embeds multiple entities, e.g. a host (hostname, serial, ram, cpus ...) and its (possibly multiple) dependent NICs (mac_address, ip_address, port_type ...).

You can declare the main table (e.g. table hosts) and the dependent entities (e.g. a table for the host NICs, host_nics) separately and then instruct the plugin to extract the different entities -- host and nics, even in 1:N cardinality -- automatically.

Refer to the provided tests to see how this mechanism works.

Development backlog

The source plugin is pretty feature complete.

It may be useful to add support for URLs in the file field (which would probably be deprecated in favour of a more generic source property) in order to have the plugin retrieve the file at the given URL prior to importing; support should be extended to at least HTTP(s), FTP, S3 and git URLs in addition to local files file://.

The implementation could leverage Hashicorp's go-getter library.

Run tests

make test

Run linter

make lint

Generate docs

make gen-docs

Release a new version

Run git tag v1.0.0 to create a new tag for the release (replace v1.0.0 with the new version number)
Run git push origin v1.0.0 to push the tag to GitHub

Once the tag is pushed, a new GitHub Actions workflow will be triggered to build the release binaries and create the new release on GitHub. To customize the release notes, see the Go releaser changelog configuration docs.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
_test		_test
client		client
docs/tables		docs/tables
plugin		plugin
resources		resources
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CloudQuery File Source Plugin

Links

Configuration

Development backlog

Run tests

Run linter

Generate docs

Release a new version

About

Releases 1

Packages

Languages

License

dihedron/cq-source-file

Folders and files

Latest commit

History

Repository files navigation

CloudQuery File Source Plugin

Links

Configuration

Development backlog

Run tests

Run linter

Generate docs

Release a new version

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages