The rag-api-pipeline
utilizes Airbyte's CDK low-code framework to create source connectors for REST APIs.
Under the hood, it generates a declarative stream manifest in YAML format using
two specification files:
- A well-defined OpenAPI specification for the target REST API: most API providers publish their OpenAPI-based schema definitions in their site. In case it is unavailable, various tools are available online to help you generate or convert an API spec schemas into the required format.
- A source RAG API pipeline manifest: you'll learn how to define it in the next section.
A base manifest template is available in the repository. The API pipeline manifest MUST comply with the following schema:
An alphanumeric name for the API pipeline.
Contains a list of parameters required for building API requests. These parameters with their values MUST be defined in the spec section. Parameter values are accessible throughout the manifest by using the config
object .
Defines the following API metadata for building requests:
request_method
: HTTP request method to use (e.g., "get")content_type
: API-supported content type (e.g., "application/json")response_entrypoint_field
: Field that wraps data records (e.g., "data"). Can be set to an empty string if not required
Parameters used to adjust how the pipeline applies chunking to the normalized data. A more detailed information about each parameter is available in the unstructured library documentation. Default parameter values are:
mode
: elementschunking_strategy
: by_titleinclude_orig_elements
: truemax_characters
: 1500new_after_n_chars
: 1024overlap
: 0overlap_all
: falsecombine_text_under_n_chars
: 0multipage_sections
: true
An Airbyte declarative manifest requires the following schema definitions:
A source specification comprising connector metadata and configuration options (Docs).
All parameters defined in the api_parameters
section MUST also be listed under required
and properties
.
The record selector is in change of translating an HTTP response into a list of Airbyte records (Docs). The API pipeline manifest template includes two base selectors for single/multiple record responses:
selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: ["data"] # data field wraps multiple record data responses
selector_single:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []
The Requester defines how to prepare HTTP requests for the source API (Docs).
It specifies the API base_url
and authenticator
schema. Airbyte supports common authentication methods: ApiKeyAuthenticator
, BearerAuthenticator
, BasicHttpAuthenticator
, and OAuth
.
Detailed configuration instructions for each method are available in this link
Defines the pagination strategy for API endpoints returning multiple records (Docs).
Airbyte supports Page increment
, Offset increment
, and Cursor based
pagination strategies. The "#/definitions/NoPagination"
is automatically set for endpoints returning a single record.
The Retriever object defines how to fetch records through synchronous API requests (Docs).
It is in charge of orchestrating the requester, record selector, and paginator. The API pipeline manifest template includes a base retriever for each selector
:
retriever_base:
type: SimpleRetriever
record_selector:
$ref: "#/definitions/selector"
retriever_single_base:
type: SimpleRetriever
record_selector:
$ref: "#/definitions/selector_single"
The endpoints
section defines a whitelist of API endpoints from the all the paths defined in the OpenAPI spec. The pipeline will only extract data from endpoints defined here.
Each endpoint should follow the following schema:
- Endpoint path: can be enclosed in double quotes to inject parameters defined in
api_parameters
(e.g.,"/example/{foo}"
) - id: string identifier for the endpoint
- primary_key (Optional): a field to be used as primary key on each record
- responseSchema: the schema returned by the API endpoint after applying the
selector
. It MUST an unwrapped schema when compared to the endpoint response schema defined in the OpenAPI spec file. - textSchema: set the list of fields that should be parsed as text inputs during the data chunking stage. Fields included here MUST be in the endpoint's
responseSchema
.
Example:
"/example/{foo}":
id: "example"
primary_key: "id"
responseSchema: "#/schemas/exampleSchema"
textSchema:
$ref: "#/textSchemas/Example"
/about:
id: "about"
primary_key: "id"
responseSchema: "#/schemas/aboutSchema"
textSchema:
$ref: "#/textSchemas/About"
Endpoint response schemas should be listed here and referenced in the endpoint's responseSchema
. Unwrapped schemas must match response schemas defined in the OpenAPI spec.
Example:
Endpoint response:
{
data: [
{id: "0", title: "foo", createdAt: "2024-11-01T21:08:05.231Z"},
{id: "0", title: "bar", createdAt: "2024-11-06T21:08:05.231Z"},
]
}
Endpoint schema:
schemas:
exampleSchema:
type: object
$schema: http://json-schema.org/draft-07/schema#
properties:
id:
type: string
title:
type: string
createdAt:
type: string
For each endpoint, you can specify the list of fields that should be be parsed as text inputs during data chunking stage.
These can be defined within this section and referenced in the endpoint's textSchema
. Fields included here MUST be in the endpoint's responseSchema
.
Example:
Endpoint textSchema:
textSchemas:
Example:
type: object
properties:
title:
type: string