Skip to content

Latest commit

 

History

History
174 lines (132 loc) · 7.2 KB

File metadata and controls

174 lines (132 loc) · 7.2 KB

API Pipeline Manifest - Overview

The rag-api-pipeline utilizes Airbyte's CDK low-code framework to create source connectors for REST APIs. Under the hood, it generates a declarative stream manifest in YAML format using two specification files:

  1. A well-defined OpenAPI specification for the target REST API: most API providers publish their OpenAPI-based schema definitions in their site. In case it is unavailable, various tools are available online to help you generate or convert an API spec schemas into the required format.
  2. A source RAG API pipeline manifest: you'll learn how to define it in the next section.

RAG API Pipeline Manifest - Schema Definition

A base manifest template is available in the repository. The API pipeline manifest MUST comply with the following schema:

api_name

An alphanumeric name for the API pipeline.

api_parameters

Contains a list of parameters required for building API requests. These parameters with their values MUST be defined in the spec section. Parameter values are accessible throughout the manifest by using the config object .

api_config

Defines the following API metadata for building requests:

  • request_method: HTTP request method to use (e.g., "get")
  • content_type: API-supported content type (e.g., "application/json")
  • response_entrypoint_field: Field that wraps data records (e.g., "data"). Can be set to an empty string if not required

chunking_params

Parameters used to adjust how the pipeline applies chunking to the normalized data. A more detailed information about each parameter is available in the unstructured library documentation. Default parameter values are:

  • mode: elements
  • chunking_strategy: by_title
  • include_orig_elements: true
  • max_characters: 1500
  • new_after_n_chars: 1024
  • overlap: 0
  • overlap_all: false
  • combine_text_under_n_chars: 0
  • multipage_sections: true

Airbyte Connector Definitions

An Airbyte declarative manifest requires the following schema definitions:

spec

A source specification comprising connector metadata and configuration options (Docs). All parameters defined in the api_parameters section MUST also be listed under required and properties.

selector

The record selector is in change of translating an HTTP response into a list of Airbyte records (Docs). The API pipeline manifest template includes two base selectors for single/multiple record responses:

selector:
    type: RecordSelector
    extractor:
      type: DpathExtractor
      field_path: ["data"] # data field wraps multiple record data responses
selector_single:
    type: RecordSelector
    extractor:
        type: DpathExtractor
        field_path: []

requester_base

The Requester defines how to prepare HTTP requests for the source API (Docs). It specifies the API base_url and authenticator schema. Airbyte supports common authentication methods: ApiKeyAuthenticator, BearerAuthenticator, BasicHttpAuthenticator, and OAuth. Detailed configuration instructions for each method are available in this link

paginator

Defines the pagination strategy for API endpoints returning multiple records (Docs). Airbyte supports Page increment, Offset increment, and Cursor based pagination strategies. The "#/definitions/NoPagination" is automatically set for endpoints returning a single record.

retriever_base

The Retriever object defines how to fetch records through synchronous API requests (Docs). It is in charge of orchestrating the requester, record selector, and paginator. The API pipeline manifest template includes a base retriever for each selector:

retriever_base:
    type: SimpleRetriever
    record_selector:
      $ref: "#/definitions/selector"
retriever_single_base:
    type: SimpleRetriever
    record_selector:
      $ref: "#/definitions/selector_single"

API Endpoints Definitions

The endpoints section defines a whitelist of API endpoints from the all the paths defined in the OpenAPI spec. The pipeline will only extract data from endpoints defined here. Each endpoint should follow the following schema:

  • Endpoint path: can be enclosed in double quotes to inject parameters defined in api_parameters (e.g., "/example/{foo}")
  • id: string identifier for the endpoint
  • primary_key (Optional): a field to be used as primary key on each record
  • responseSchema: the schema returned by the API endpoint after applying the selector. It MUST an unwrapped schema when compared to the endpoint response schema defined in the OpenAPI spec file.
  • textSchema: set the list of fields that should be parsed as text inputs during the data chunking stage. Fields included here MUST be in the endpoint's responseSchema.

Example:

"/example/{foo}":
    id: "example"
    primary_key: "id"
    responseSchema: "#/schemas/exampleSchema"
    textSchema:
      $ref: "#/textSchemas/Example"
/about:
    id: "about"
    primary_key: "id"
    responseSchema: "#/schemas/aboutSchema"
    textSchema:
      $ref: "#/textSchemas/About"

schemas

Endpoint response schemas should be listed here and referenced in the endpoint's responseSchema. Unwrapped schemas must match response schemas defined in the OpenAPI spec.

Example:

Endpoint response:

{
    data: [
        {id: "0", title: "foo", createdAt: "2024-11-01T21:08:05.231Z"}, 
        {id: "0", title: "bar", createdAt: "2024-11-06T21:08:05.231Z"},
    ]
}

Endpoint schema:

schemas:
    exampleSchema:
        type: object
        $schema: http://json-schema.org/draft-07/schema#
        properties:
        id:
            type: string
        title:
            type: string
        createdAt:
            type: string

textSchemas

For each endpoint, you can specify the list of fields that should be be parsed as text inputs during data chunking stage. These can be defined within this section and referenced in the endpoint's textSchema. Fields included here MUST be in the endpoint's responseSchema.

Example:

Endpoint textSchema:

textSchemas:
    Example:
        type: object
        properties:
        title:
            type: string