Skip to content

Latest commit

 

History

History

mapping-generator

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

The problem

We write mapping files by hand for now. These files are RDF files that follow the RML.io spec and that we write as JSON-LD.

  • These mapping files get big because rml in json-ld is quite verbose.

  • These mapping files refer often to a prefix that changes each time you re-upload the file. Or a predicate that we need to keep consistent accross mapping files.

  • These mapping files might contain errors that you only notice because a property is missing or empty in the final result. This is hard and time-consuming to check.

Expected result

We have an automated script that we feed a less verbose mapping file with less duplication. The script then validates that mapping file (ideally also against the uploaded file) and generates the actual json ld file.

The approach

  • We have achieved good results by using jsonnet to generate kubernetes json files. Kubernetes json files suffer from a lot of the same problems (verbosity and variables that are used all over the place). So to start we’ll write a jsonnet file that allows you to write a mapping file in a more succinct manner. This will also make sure that the json-ld is valid (jsonnet functions impose a more strict api then json)

  • We can then add a js script that uses the graphql api to validate that all columns that are referred to actually exist

  • We can then add a feature that reports back all properties used in a sorted list to quickly lift out typo’s.

  • we can then also replace all known prefixes so custom predicates stand out more.

A note on the examples

In the examples below I will use a very simple uploaded file that contains two collections. Persons and places.

Places looks like this:

id name country

1

Amsterdam

The Netherlands

2

Paris

France

Persons looks like this:

id firstName lastname birthplace

1

Jean

Martin

2

2

Jan

Jansen

1

The number in birthplace refers to the id of the row in place. So Jean Martin was born in Paris.

mappings

You start by importing the file rml.libsonnet

local rml = import "./rml.libsonnet";

You then call the mappings function with an array of `mapping`s

rml.mappings([
  rml.mapping(...),
  rml.mapping(...)
])

Each mapping call gets

  • the name of the mapping (used for joins)

  • the name of the file that you are reading the data from. This file name changes each upload so I suggest acquiring it through std.extVar (see example below)

  • the number of the collection that it refers to

  • a source (explained later on) for generating the subject uri

  • a source for generating the class uri

  • a hashmap of predicate uri’s and the Fields to generate the properties.

  rml.mapping("name_of_the_mapping", "filename", 9,
    rml.templateSource("http://example.org/things/{ID}"),
    rml.constantSource("http://example.org/types/thing),
    {
      "http://schema.org/name": rml.dataField(rml.types.string, rml.columnSource("label")),
    }
  ),

This mapping will then create the config that generates an object whose uri is http://example.org/things/{ID}, whose type is http://example.org/types/thing and that has a schema:name with the value of the column "label"

If you want to generate the same predicate multiple times (e.g. a thing has multiple schema:name’s) then you can specify an array of fields:

  rml.mapping("name_of_the_mapping", "filename", 9,
    rml.templateSource("http://example.org/things/{ID}"),
    rml.constantSource("http://example.org/types/thing),
    {
      "http://schema.org/name": [
        rml.dataField(rml.types.string, rml.columnSource("label")),
        rml.dataField(rml.types.string, rml.columnSource("alternate_label")),
      ],
    }
  ),

sources

In the mapping you can define how to extract data from the uploaded file using sources. A source can be one of four types:

templateSource

allows you to specify a string that contains column names in in curly braces like templateSource("Dear {firstName} {lastname},")

columnSource

that contains the name of a column: columnSource("country")

constantSource

that contains the string to use in the rdf: constantSource("male")

jexlSource

contains jexl code that has a variable called v that contains the columns. E.g. jexlSource('v.firstName != null ? "Dear " + v.firstName : "Mr. " + v.lastname + ","')

fields

A source is needed for generating the subject and the class value, but all properties require not just a value but also a description of what the value is. Therefore each source is wrapped in a field. The following fields are available:

iriField

the source should result in some IRI e.g. rml.iriField(rml.constantSource("http://example.org/foo"))

dataField

the source results in some string according the the datatype that is provided as the first argument. e.g. rml.dataField(rml.types.edtf, rml.constantSource("2018-01-??"))

joinField

see joins below

joins

To link two entities together, you simple generate an iri that is the same as the subject IRI of the other table. So given our example above we could have the following mapping

local rml = import "./rml.libsonnet";

rml.mappings([
    rml.mapping("PersonsMapping", "{{filename}}", 1,
      rml.templateSource(rml.datasetUri + "collection/Persons/{id}"),
      rml.constantSource(rml.datasetUri + "collection/Persons"),
      {
        "http://schema.org/givenName": rml.columnSource("firstName"),
        "http://schema.org/familyName": rml.columnSource("lastname"),
        "http://schema.org/birthPlace": rml.iriField(rml.templateSource(rml.datasetUri + "collection/Places/{birthplace}")),
      },
    ),
    rml.mapping("PlacesMapping", "{{filename}}", 2,
      rml.templateSource(rml.datasetUri + "collection/Places/{id}"),
      rml.constantSource(rml.datasetUri + "collection/Places"),
      {
        "http://schema.org/name": rml.columnSource("name")
      },
    )
  ])

But this only works if the IRI that we generate for the place happens to contain only the identifiers that we also have at our disposal in the persons collection!

What if we want to generate IRIs for the places that contain the placename? (assuming this results in unique IRI’s per row) rml.templateSource(rml.datasetUri + "collection/Places/{country}/{name}");

For this usecase the mapping allows you to refer to the subject IRI’s as generated by a different mapping using the joinField. The complete mapping would then be:

local rml = import "./rml.libsonnet";

rml.mappings([
    rml.mapping("PersonsMapping", "{{filename}}", 1,
      rml.templateSource(rml.datasetUri + "collection/Persons/{id}"),
      rml.constantSource(rml.datasetUri + "collection/Persons"),
      {
        "http://schema.org/givenName": rml.columnSource("firstName"),
        "http://schema.org/familyName": rml.columnSource("lastname"),
        "http://schema.org/birthPlace": rml.joinField(
          "birthplace", //the column in this collection that contains the value to match on (must be 1 column)
          "PlacesMapping", //the name of the mapping whose subject IRI's we're going to use
          "id" //the name of the column in the collection that's behind 'PlacesMapping' whose value we should match against
        ),
      },
    ),
    rml.mapping("PlacesMapping", "{{filename}}", 2,
      rml.templateSource(rml.datasetUri + "collection/Places/{country}/{name}"),
      rml.constantSource(rml.datasetUri + "collection/Places"),
      {
        "http://schema.org/name": rml.columnSource("name")
      },
    )
  ])