diff --git a/README.md b/README.md index c2966ee3..16c9587e 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,167 @@ # gateway -### Documentation - -- [overview](docs/overview.md) -- [configuration](docs/configuration.md) -- [errors](docs/errors.md) -- [logs](docs/logs.md) -- [metrics](docs/metrics.md) -- [releases](docs/releases.md) -- [query fees](docs/query-fees.md) -- [incident response](docs/incident-response.md) - -### Contributing +## overview + +A gateway, in the Graph Network, is a client capable of managing relationships between many +data consumers and many indexers. A gateway is not necessary for a consumer to access the indexers +participating in the Graph Network, though it does simplify consumer interactions. + +A gateway is expected to be a reliable system, to compensate for indexers being relatively +unreliable. Indexers may become unresponsive, malicious, or otherwise unsuitable for serving queries +at any time without warning. It is the responsibility of the gateway to maintain the highest +possible quality of service (QoS) to clients under these conditions. + +The gateway's primary responsibilities are to serve client requests and to facilitate indexer +payments. Other responsibilities, though important, are secondary and therefore their failure modes +must have minimal impact on the primary responsibilities. + +## indexer discovery + +For the gateway to route client queries to indexers, it must be able to associate subgraph and +subgraph deployment IDs with the indexers that have active allocations on the subgraph deployment +being queried. The tree of subgraphs, subgraph deployments (versions), and allocated indexers is +accessible via the network subgraph which indexes the Graph Network contracts. + +The gateway periodically queries the network subgraph for this data using a set of trusted indexers. +The trusted indexers are not necessary theoretically, but they avoid an otherwise cumbersome +boostrapping process for for payments. + +When an indexer registers itself via the contract, it provides a URL to access its indexer-service. +After the subgraph data is collected and organized, the gateway requests more information from each +active indexer via the indexer-serivce. This inculdes software version information and, for each +allocation, the indexing status (progress on chain indexed by subgraph deployment) and cost models. + +The gateway may be configured to block public Proofs Of Indexing (POIs) that have been associated +with bad query responses. In this case deployments with such POIs require an additional step in +discovery where indexers are required to submit their public POI. If the indexer's POI has been +blocked, the indexer will not be considered to serve queries on the associated deployment until it +returns a good POI in a subsequent request. + +## auth + +The gateway requires client requests (queries) to include an API key which associates the request +with some consumer to track usage for payment to the gateway operator. The API key may have +additional settings or restrictions that are checked before executing or rejecting the request. + +## queries + +Request paths can take 3 general shapes: + +- subgraph ID (from GNS contract) +- deployment ID (IPFS hash of manifest) +- deployment ID & indexer address + +Requests by subgraph ID must first be resolved to some deployment ID. The gateway selects the latest +deployment where some indexer reports an indexing status within 30 minutes of chain head. If no +deployment meets this requirement, the latest deployment is selected. + +Requests specifying an indexer address are only intended to facilitate cross-checking indexer +responses. Using this option for production data requests are not guaranteed to behave as expected. +The rest of this section will assume that an indexer address has not been provided in the request. + +The indexer request may be rewritten by the gateway to get additional data required by the gateway +to track the progress of each indexer relative to the indexed chain. The request containing the +client request and potentially additional data is called the "indexer request". + +A subset of up to 3 indexers will be selected to execute the indexer request. These indexers are +selected based on some combination of the following criteria (implemention at https://github.com/edgeandnode/candidate-selection): + +- success rate +- expected latency +- seconds behind chain head +- slashable GRT +- fee requested from indexer cost model (relative to gateway budget) + +The first response from an indexer (that passes through some additional filters) is returned to the +client, after stripping out data not requested by the client. All indexer responses are used to feed +back performance information into the indexer selection algorithm. If all 3 indexers fail to respond +to the request, then this process is repeated until all available indexers are exhausted. + +## data science + +The gateway exports data into 3 kafka topics: + +- client requests (`gateway_client_query_results`) +- indexer requests (`gateway_indexer_attempts`) +- attestations (`gateway_attestations`) + +## indexer paymets + +The gateway serves its budget per indexer request, in USD, at `/budget`. Indexers make their prices +available via Agora cost-models. These cost models are served, for each subgraph deployment, by +indexer-service at `/cost`. When selecting indexers, the gateway first executes their cost models +over the indexer request to obtain each indexer's fee. Indexer selection will favor indexers with +lower fees, all else being equal. The gateway has a control system that may pay indexers more than +they request via their cost models in an effort to hit an average of `budget` fees per client query. +Indexer fees are clamped to a maximum of the gateway's budget. + +### TAP + +For an overview of TAP see https://github.com/semiotic-ai/timeline-aggregation-protocol. + +The gateway acts as a TAP sender, where each indexer request is sent with a TAP receipt. The gateway +operator is expected to run 2 additional services: + +- [tap-aggregator](https://github.com/semiotic-ai/timeline-aggregation-protocol/tree/main/tap_aggregator): + public endpoint where indexers can aggregate receipts into RAVs +- [tap-escrow-manager](https://github.com/edgeandnode/tap-escrow-manager): + maintains escrow balances for the TAP sender. This service requires data exported by the gateway + into the "indexer requests" topic to calculate the value of outstanding receipts to each indexer. + +The gateway operator is also expected to manage at least 2 wallets: + +- sender: requires ETH for transaction gas and GRT to allocate into TAP escrow balances for paying indexers +- authorized signer: used by the gateway and tap-aggregator to sign receipts and RAVs + +### Scalar + +The Timeline Aggregation Protocol (TAP) significantly reduces the requirement for indexers to trust +the gateway to collect the payments they are owed. More details [here](https://github.com/semiotic-ai/timeline-aggregation-protocol). +For this reason, the original Scalar payment system is being phased out. + +## operational notes + +### configuration + +Nearly all configuration is done via a single JSON configuration file, the path of which must be +given as the first argument to the graph-gateway executable. +e.g. `graph-gateway path/to/config.json`. The structure of the configuration file is defined in +[config.rs](src/config.rs) (`graph_gateway::config::Config`). + +Log filtering is set using the `RUST_LOG` environment variable. For example, if you would like to +set the default log level to `info`, but want to set the log level for the `graph_gateway` module to +`debug`, you would use `RUST_LOG="info,graph_gateway=debug"`. More details on evironment variable +filtering: https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html. + +### errors + +Each error that be returned to the client when making a request is defined in [errors.rs](src/errors.rs) (`gateway_framework::errors::Error`). + +### logs + +Log events are emitted for each client request and all of its associated indexer requests. These can +be found using the span label `client_request`. The indexer request log events also contain the +label `indexer_request`. + +log levels: + +- `error`: An unexpected state has been reached, which is likely to have a negative impact on the + gateway's ability to serve queries or make payments. +- `warn`: An unexpected state has been reached, though it is recoverable and unlikely to have a + negative impact on the gateway's ability to serve queries or make payments. +- `info`: Information that is commonly used to trace the execution of the major gateway subsystems + in production. +- `debug`: Similar to `info`, but is often irrelevant when investigating gateway execution in + production. +- `trace`: Information that is considered too verbose for production, but is often useful during + development. + +### metrics + +Prometheus metrics are served at `:${METRICS_PORT}/metrics`. +The available metrics are defined in [metrics.rs](src/metrics.rs). + +## Contributing The gateway is an open-source project and we welcome contributions. Please see our [contributing guide](docs/contributing.md) for more information. diff --git a/docs/configuration.md b/docs/configuration.md deleted file mode 100644 index 79ee952c..00000000 --- a/docs/configuration.md +++ /dev/null @@ -1,5 +0,0 @@ -# Configuration - -Nearly all configuration is done via a single JSON configuration file, the path of which must be given as the first argument to the graph-gateway executable. e.g. `graph-gateway path/to/config.json`. The structure of the configuration file is defined in [config.rs](../src/config.rs) (`graph_gateway::config::Config`). - -Logs filtering is set using the `RUST_LOG` environment variable. For example, if you would like to set the default log level to `info`, but want to set the log level for the `graph_gateway` module to `debug`, you would use `RUST_LOG="info,graph_gateway=debug"`. More details on evironment variable filtering: https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html. diff --git a/docs/errors.md b/docs/errors.md deleted file mode 100644 index 8318e851..00000000 --- a/docs/errors.md +++ /dev/null @@ -1,7 +0,0 @@ -# Errors - -Error messages are given to users when the gateway is unable to return a "suitable" response, from an indexer, to the client. - -Each specific error that be returned to the client when making a subgraph query is defined in [errors.rs](../src/errors.rs) (`gateway_framework::errors::Error`). - -Under normal conditions, outside of user error, regressions in gateway performance may be identified using the details provided in the `BadIndexers` error. This message includes a list of indexer errors encountered, in descending order of how many of the potential indexers failed for that reason. diff --git a/docs/incident-response.md b/docs/incident-response.md deleted file mode 100644 index b24b2790..00000000 --- a/docs/incident-response.md +++ /dev/null @@ -1,64 +0,0 @@ -# Incident Response - -## Notes - -- This document assumes logs can be filtered using [Loki LogQL](https://grafana.com/docs/loki/latest/query/log_queries/). See [logs.md](./logs.md) for a list of the log fields available. - -- This doc uses the [Graph Network Arbitrum subgraph](https://thegraph.com/explorer/subgraphs/DZz4kDTdmzWLWsV373w2bSmoar3umKKH9y82SUKr5qmp?view=About&chain=arbitrum-one) for examples: - - `subgraph_id: DZz4kDTdmzWLWsV373w2bSmoar3umKKH9y82SUKr5qmp` - - `deployment_id: QmSWxvd8SaQK6qZKJ7xtfxCCGoRzGnoi2WNzmJYYJW9BXY` - -## Common Log Queries - -- `|= "client_request" |= "result" != "Ok"` -- `|= "indexer_request" |= "result" != "Ok"` -- `|= "client_request" |= "indexer_errors" != "{}"` - -## Scenarios - -### No Indexer Available to Serve Client Query - -- check indexer errors: - - ```ts - |~ "DZz4kDTdmzWLWsV373w2bSmoar3umKKH9y82SUKr5qmp|QmSWxvd8SaQK6qZKJ7xtfxCCGoRzGnoi2WNzmJYYJW9BXY" - |= "client_request" |= "indexer_errors" != "{}" - ``` - -- check for failed client queries: - - ```ts - |~ "DZz4kDTdmzWLWsV373w2bSmoar3umKKH9y82SUKr5qmp|QmSWxvd8SaQK6qZKJ7xtfxCCGoRzGnoi2WNzmJYYJW9BXY" - |= "client_request" |= "result" != "Ok" - ``` - -For indexers, note that automated allocation management might not allocate to a subgraph deployment if it doesn’t meet requirements like minimum signal. - -### Bad/Inconsistent Query Responses - -- Graphix is a useful tool to check if allocated indexers have divergent POIs, and might indicate which indexers are delivering the bad responses. - - If tools like Graphix are not available, you can query the relevant indexers manually to get their POIs: - - ```bash - curl ${indexer_url}/status \ - -H 'content-type: application/json' \ - -d '{"query": "{ publicProofsOfIndexing(requests: [{deployment: \"${deployment}\" blockNumber: ${block_number}}]) { deployment proofOfIndexing block { number } } }"}' - ``` - -- If a POI is identified that should be blocked, it should be added to the gateway config’s `poi_blocklist`. - -- Does the query rely on a graph-node feature that is unsupported? - - This is an open problem, see issue #526. - -- As a worst-case measure, the gateway config also includes an indexer blocklist `bad_indexers`. This should be used temporarily, and with caution. - -### Degraded performance on multiple subgraphs - -- `|= "ERROR"`: error logs may show negative impacts on the gateway's ability to serve queries or make payments. - - `|= "poll_subgraphs_err"`: failures to poll asubgraph (potentially the network subgraph) - -### Other - -- For other scenarios, it may be useful to identify a query where some issue occurred. Then filter for all logs containing the corresponding `request_id`. diff --git a/docs/logs.md b/docs/logs.md deleted file mode 100644 index bd78ece1..00000000 --- a/docs/logs.md +++ /dev/null @@ -1,13 +0,0 @@ -# Logs - -## Log Levels - -- `error`: An unexpected state has been reached, which is likely to have a negative impact on the gateway's ability to serve queries or make payments. -- `warn`: An unexpected state has been reached, though it is recoverable and unlikely to have a negative impact on the gateway's ability to serve queries or make payments. -- `info`: Information that is commonly used to trace the execution of the major gateway subsystems in production. -- `debug`: Similar to `info`, but is often irrelevant when investigating gateway execution in production. -- `trace`: Information that is considered too verbose for production, but is often useful during development. - -## client requests - -Log events are emitted for each client request and all of its associated indexer requests. These can be found using the span label `client_request`. The indexer request log events also contain the label `indexer_request`. diff --git a/docs/metrics.md b/docs/metrics.md deleted file mode 100644 index 2c1298ab..00000000 --- a/docs/metrics.md +++ /dev/null @@ -1,5 +0,0 @@ -# Metrics - -Prometheus metrics are served at `:${METRICS_PORT}/metrics` - -The available metrics are defined in [metrics.rs](../src/metrics.rs). diff --git a/docs/overview.md b/docs/overview.md deleted file mode 100644 index eaa004f6..00000000 --- a/docs/overview.md +++ /dev/null @@ -1,22 +0,0 @@ -# Overview - -At a high level the gateway does 2 things: - -1. Route client requests to indexers -2. Facilitate payments to indexers - -## Query Lifecycle - -1. The client GraphQL request arrives, including an auth token (API key or query key). The auth token is used to check associated allowlists, payment status, etc. -2. Indexers are selected from the set allocated to the subgraph deployment being queried. For queries by subgraph ID (GNS ID), indexers are selected across the allocations on all associated deployments (subgraph versions). -3. A subset of up to 3 indexers are selected based on a variety of selection factors including reliability, latency, subgraph version (if applicable), etc. -4. The request is made deterministic by replacing block numbers with hashes for the chain being indexed by the subgraph. -5. The request is forwarded to each selected indexer. -6. Each indexer’s response, latency, etc. is fed back into indexer selection. -7. The first valid indexer response is returned it to the client. If no indexers return a valid response goto step 3. - -## Design Principles - -- The gateway is designed to be a reliable system, to compensate for indexers being relatively unreliable. Indexers may become unresponsive, malicious, or otherwise unsuitable for serving queries at any time without warning. It is the responsibility of the gateway to maintain the highest possible quality of service(QoS) under these conditions. - -- The gateway's primary responsibilities are to serve client requests and to facilitate indexer payments. Other responsibilities, though important, are secondary and therefore their failure modes must have minimal impact on the primary responsibilities. diff --git a/docs/query-fees.md b/docs/query-fees.md deleted file mode 100644 index 08aa9aba..00000000 --- a/docs/query-fees.md +++ /dev/null @@ -1,13 +0,0 @@ -# Query Fees - -## Initial Notes - -- In most contexts a "query" or "client query", refers to a single GraphQL HTTP request from the client. However, [Agora](https://github.com/graphprotocol/agora) cost models define a query as a top level selection for the operation being executed ([spec](https://spec.graphql.org/October2021/#sec-Selection-Sets)). The rationale is that this roughly translates to the amount of SQL queries made by graph-node to execute the query, at the time Agora was designed. This may seem like a useful measure for a query's computational complexity, but is has been shown to be practically unrelated to "real" query cost. - -## Introduction - -The gateway serves its budget per client query, in USD, at `/budget`. Indexers make their prices available via Agora cost-models. These cost models are served, for each subgraph deployment, by indexer-service at `/cost`. When selecting indexers, the gateway first executes their cost models over the client query to obtain each indexer's fee. Indexer selection will favor indexers with lower fees, all else being equal. Indexer fees are clamped to a maximum of the gateway's budget. - -## Implementation Details - -The gateway has a control system that may pay indexers more than they request via their cost models in an effort to hit an average of `budget` fees per client query. diff --git a/docs/releases.md b/docs/releases.md deleted file mode 100644 index 55fdcbc5..00000000 --- a/docs/releases.md +++ /dev/null @@ -1,15 +0,0 @@ -# Releases - -1. Test the main branch using the [edgeandnode/local-network](https://github.com/edgeandnode/local-network) - - Make sure that the gateway returns a correct response to a valid query - - Run any other ad-hoc tests that are appropriate for the set of changes made since the last release -2. Open a PR for the new release on [edgeandnode/graph-gateway](https://github.com/edgeandnode/graph-gateway) - - Version the release based on [SemVer](https://semver.org/). The following trigger a major version bump: - - Breaking changes to the configuration file - - Set the new version in `Cargo.toml`, and run `cargo update` - - Include release notes for changes since the last release. See past releases for format. - - Rebase & Merge the PR - - Create a new release via GitHub - - Include the release notes from the PR - - Tag the commit with the version string, prefixed with a `v` -