Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing design for w3s #29

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
6528294
Indexing design for w3s
gammazero May 29, 2024
4eab388
embed diagram
gammazero Jun 2, 2024
f6870b4
rename query diagram
gammazero Jun 2, 2024
24197d3
Add diagram tldr file
gammazero Jun 3, 2024
f878d55
requested changes from review
gammazero Jun 4, 2024
4f66930
Update rfc/indexing-design.md
gammazero Jun 4, 2024
64015b6
Update rfc/indexing-design.md
gammazero Jun 4, 2024
f470b46
Update rfc/indexing-design.md
gammazero Jun 4, 2024
ca152c6
Update rfc/indexing-design.md
gammazero Jun 4, 2024
b898f36
review updates
gammazero Jun 4, 2024
4d4049c
Add discussion summary
gammazero Jun 7, 2024
592811c
Update rfc/indexing-design.md
gammazero Jun 11, 2024
88e1689
Update rfc/indexing-design.md
gammazero Jun 11, 2024
0eb08ec
Update rfc/indexing-design.md
gammazero Jun 11, 2024
32ca892
Update rfc/indexing-design.md
gammazero Jun 11, 2024
9f10b69
Update rfc/indexing-design.md
gammazero Jun 11, 2024
3e7758a
Update rfc/indexing-design.md
gammazero Jun 11, 2024
55214d9
Update rfc/indexing-design.md
gammazero Jun 11, 2024
038225a
Update rfc/indexing-design.md
gammazero Jun 11, 2024
95e01bb
Update rfc/indexing-design.md
gammazero Jun 11, 2024
9dd095c
detail operation description
gammazero Jun 12, 2024
1e4c3db
revierw changes
gammazero Jun 12, 2024
05d30ae
update query description
gammazero Jun 12, 2024
e215eba
Combine w3up-index and ipni-cache
gammazero Jun 17, 2024
f532427
fix diagram error
gammazero Jun 17, 2024
cc99f9e
update diagrams
gammazero Jun 17, 2024
703643a
update API description
gammazero Jun 17, 2024
c999858
Update rfc/indexing-design.md
gammazero Jun 17, 2024
0c5b91d
Update rfc/indexing-design.md
gammazero Jun 17, 2024
9182f9d
Update rfc/indexing-design.md
gammazero Jun 17, 2024
ba31778
Update rfc/indexing-design.md
gammazero Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 234 additions & 0 deletions rfc/indexing-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# W3 Indexing Design
> Replaces [W3 IPNI Indexing RFC](https://github.com/w3s-project/RFC/blob/main/rfc/ipni-w3c.md)

## Authors

- [gammazero]
- [gozala]

## Goals

A user can use a multihash, that identifies data stored on w3s, to lookup the all information necessary to retrieve and delegate access to that data.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

Location commitments that describe where content is stored can change over time as w3s moves data to a more favorable location, re-aggregates data storage, or replicates stored data. Location commitments are updated without having to index all multihashes.

A query API will allow a user to specify indexed content to lookup and how to filter results. The API retrieves previously cached responses or composes new IPNI queries to get new responses to filter and cache.

## Overview

A user stores content in shards of opaque blobs that contain a series of blocks addressed by multihashes. Blobs are set of blocks in CAR file format (although they could use different format in the future) and are stored at a location designated for a particular user's data.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

The agent uploading user content builds a UnixFS IPLD DAG and encodes it into one or more blobs. It also derives ([sharded-dag-index](https://github.com/w3s-project/specs/blob/main/w3-index.md#sharded-dag-index-example)) that describe blob in terms of (multihash, offset, length) tuples for all the blocks. This enables the user to make choices about which block to be make discoverable and retrievable over the network. For example, they may want only the DAG root to be indexed.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

The blob-index is stored in the user's space along with the CAR file(s) containing the blob data. This is done in the [w3 add-blob operation](https://github.com/w3s-project/specs/blob/main/w3-blob.md#add-blob). The w3s service creates [location commitments](https://github.com/w3s-project/specs/blob/main/w3-blob.md#location-commitment) for the blob shards and for the blob-index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 This got me thinking that perhaps we don't have to require separate blob/add for the index and allow encoding it along with the content shard itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to keep them separate is because the index describes all shards, and shards may be stored in separate locations. If only the 3rd shard is needed then only the index and that shard need to be read.

It may be acceptable to encode the index as part of the first shard... but then how does the client know how much of the first shard to read? Would there be a fixed-length header?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a recursive issue - you can't create an index for a CAR that includes the index...


After the add-blob process completes, and the user receives an [accept-blob receipt](https://github.com/w3s-project/specs/blob/main/w3-blob.md#accept-blob-receipt), the user may then choose to make their data publicly queryable by publishing it to W3UP's indexing system. The user optionally invokes an [index-add](https://github.com/w3s-project/specs/blob/main/w3-index.md#index-add) capability to publish the blob-index multihashes into W3UP's indexing system and eventually to [IPNI](https://github.com/ipni/specs/blob/main/IPNI.md) so that they can be used to look up location commitments and retrieve blob-index information and blob data. See [W3 Index](https://github.com/w3s-project/specs/blob/main/w3-index.md#w3-index) for more.

After publishing an index, any user looking for content can query for the blob-index of that content. The user read the blob-index to determine what data shard needs to be retrieved, and then asks for the location of that shard.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

Finally the user uses the shard location, with range information from the blob index, to retrieve a specific piece of data from w3s.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

## Index Query

A w3s client can query indexed information by specific attributes, and can filter retrieved query results.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

The queryable attributes are as follows.
- Site: This lets the query ask for results where data is stored at specified site(s), i.e. query can specify "where site in [X]".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"lets the query ask" does not read well to me - a query is a question.

Suggested change
- Site: This lets the query ask for results where data is stored at specified site(s), i.e. query can specify "where site in [X]".
- Site: This queries for data stored at specified site(s), i.e. query can specify "where site in [X]".

Can you change for all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to do this? Typically we want to know where some data is, what other information can we get when we already know where to ask?

- Shard: This lets the query ask for results for a specific shard, i.e. query can specify "where shard in [S]".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do I need to do this? How is this different to getting information for a hash of a slice/block?

- Slice: This lets the query ask for a specific block of data, identified my multihash with a result being the byte range within a shard containing the block of data.

Query results can be filtered to select only results that the user has access to, or for data supplied by specific data providers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query results can be filtered to select only results that the user has access to

How does this work?

or for data supplied by specific data providers.

How is this different from a query by site (as above)?


### Query Specification

```
match = {
subject=[myID, myOrgID],
location=spBaseURL,
}
results = Query(multihash="bafy...", match)
gammazero marked this conversation as resolved.
Show resolved Hide resolved
```

### Query Operation Diagram

![w3up-query-diagram](w3up-query-diagram.png)

## Indexing System Components

The indexing components are intended for deployment on the w3up gateway. They may also be deployed to the w3up client, where they will not be shared with other clients.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

### W3up Index

The w3up index layer is responsible for reading a query specification, and then
1. Creating a set of IPNI queries needed to get the requested index content
2. Sending the queries to the w3up IPNI cache
3. Filtering returned results
4. Packaging results for the query client.

### W3up IPNI and Index Cache

The w3up IPNI Cache is a cache that holds IPNI query results and sharded-dag-index data. The cache is:

- Limited-size LRU. Discards the least recently used data when at storage capacity.
- Temporary. Evicts items that have been cached for longer than the cache expiration time, regardless of last access. This allows changes to query results to be seen.
- Negative (empty) result-aware. Caching empty results prevents more expensive IPNI queries for data that is not indexed. Negative cache entries are kept in a separate cache so that negative entries cannot evict a positive entrries due to LRU.
gammazero marked this conversation as resolved.
Show resolved Hide resolved
- Populated on write: When new or updated data is publisher to IPNI, the cache is populated. This replaces previous any previous cache entries, including negative ones.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

Similar to negative cache entries, cached sharded-dag-indexes are kept in a separate LUR cache because their size will cause them to be evicted more frequently than IPNI results.
gammazero marked this conversation as resolved.
Show resolved Hide resolved

This is a read-through cache, meaning if it does not hold the request query results, then it forwards the query on to IPNI, and caches the results. This includes caching an empty response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caching an empty response can be problematic as we may not have a record on first query which would get cached and prevent us from discovering records on subsequent queries. Perhaps short TTL on empty result could be a way to address this or some form of exponential backoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully addressed by read-on-write -- if we're capturing negative responses we need to write the cache when we publish IPNI.

Because otherwise @Gozala makes a very good point and we've seen this exact problem with IPNI and Cloudflare caching -- 404s getting cached for 10 minutes if you query too soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having cache-on-write fixes the case where new data is stored subsequent to a query for that data. When the new data is stored, or location info is updated, that would remove any negative cache. Without cache-on-write then a short TTL is necessary. With cache-on-write, is there still a reason for a short TTL?

Negative cache entries should be kept in a separate cache so that a negative entry cannot evict a positive entry due to LRU, thereby removing the more valuable positive entries and allowing a misbehaving client to empty the cache.

Will clarify in doc.


There are two types of queries send to the cache. Each takes a data multihash, but returns a different result:
1. shard/slice query. Returns one or more: shard CID, slice CID, offset, length (in shard).
2. location query. Returns: location commitment of each shard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure while shard and slice queries are grouped into one type and the location into the other, I suspect there is some reason which is probably worth outlining

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately I think it would be nice if this layer was generic - agnostic to kinds of query types. That is to say it should not care what attribute blob/shard, blob/slice or commitment/location is queried by it simply can derive the cache key and on cache miss miss translate query into an IPNI query without worrying about the semantics.

Maybe that is the intention, but reading this I left with an impression that it's not generic over the attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some need to know what kind of data is being retrieved, not for creating an IPNI query, but to know what to get out of the blob-index indicated in the results. With a shard/slice query, the resulting blob-index must be read to get the CID/offset/length of the shard or slice. With a location query, the resulting blob-index must be read to get the shard location commitments.

Can this be generic when asking for different things? This part probably requires more understanding of the types of answers that a client will ask for, so we can determine if there is a generic way to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how smart this query/cache layer should be. I'd be more inclined to have it only allow querying, doing the joins and caching the result.


By using both these query results, a specific portion of the blob data can be retrieved. The shard(s) are read from the locations in the location commitments, and data ranges within each shard are requested (when not reading an entire shard) by range requests using the offsets and lengths from the shard/slice data.

If either shard or location results are not cached, then a query for the multihash is passed to the IPNI client.

#### Expiration Times
The shard/slice data does not change over the lifetime of the stored data, so this data can have a much longer expiration time. It could also have no expiration time and be explicitly removed from cache when the data is no longer stored, but this means that there need to be a communication path from data removal to all cache locations. So, better to just have an expiration time, making cache more independent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to evict information about data even if we don't store that data. It could be that location of the data is not in our system (e.g. shared privately) while index of it is shared publicly. That is to say I don't think we need to worry about removing shard/slice info and instead let the user perform manual eviction when they want to do so. Specifically I imagine user would delete CAR holding an index which could trigger eviction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about a side issue here -- what do we do with two people who upload the same blob but publish different indexes? Seems like this would be relevant. Probably addressed if we add the "account paying" query param that the egress billing design calls for -- does make me wonder if we need to filter CID results not only on "are they W3UP" but also are "are they for the right person for this query?"


### IPNI
IPNI is the final, "cache-miss" layer of the W3up Indexing System. IPNI is a highly optimized system for caching indexing information for massive numbers of CIDs. It's available to all and optimized to run on bare metal hardware. IPNI is used to do the following:
1. Provide a reliable global index for all w3up content, that is not only publicly available but usable by an IPFS retrieval client, without the other components of the w3up indexing system
2. Provide a low cost way to index content on w3up that is infrequently requested
#### Get w3up results

IPNI maps a multihash to location and metadata information. When queried with a multihash, IPNI will respond with all known locations/metadatas of that multihash. The w3up IPNI query client must filter these results to only select results that are under the control of w3up.

Example IPNI query and response:
```bash
curl https://cid.contact/cid/bafybeicawc3qwtlecld6lmtvsndimoz3446xyaprgsxvhd3aapwa2twnc4 -sS | jq
```
```json
{
"MultihashResults": [
{
"Multihash": "EiBAsLcLTWQSx+WydZNGhjs75z18AfE0r1OPYAPsDU7NFw==",
"ProviderResults": [
{
"ContextID": "YmFndXFlZXJheTJ2ZWJsZGNhY2JjM3Z0em94bXBvM2NiYmFsNzV3d3R0aHRyamhuaDdvN2o2c2J0d2xmcQ==",
"Metadata": "gBI=",
"Provider": {
"ID": "QmQzqxhK82kAmKvARFZSkUVS6fo9sySaiogAnx5EnZ6ZmC",
"Addrs": [
"/dns4/elastic.dag.house/tcp/443/wss"
]
}
},
{
"ContextID": "YmFndXFlZXJheTJ2ZWJsZGNhY2JjM3Z0em94bXBvM2NiYmFsNzV3d3R0aHRyamhuaDdvN2o2c2J0d2xmcQ==",
"Metadata": "oBIA",
"Provider": {
"ID": "QmUA9D3H7HeCYsirB3KmPSvZh3dNXMZas6Lwgr4fv1HTTp",
"Addrs": [
"/dns4/dag.w3s.link/tcp/443/https"
]
}
}
]}]}
```

#### Get sharded-dag-index

After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from.

The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes in a different direction than what I was imagining. Specifically I thought that IPNI publisher would derive and publish advertisements from the sharded-dag-index so that client querying IPNI would not have to fetch the index. Specifically idea was that multihash would be combined with an attribute / relation being queried allowing us to W3Up IPNI cache to paralellize queries as opposed to having to block on index retrieval before it is able to pull out details.

Part of this was also motivated by desire to make things generic so new kinds of indexes could be introduced by end users or even us without affecting any of the layers of this system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a hybrid model could offer best of both worlds ? E.g. we could use index identifier as a namespace and expect that all attributes under that namespace would be aggregated in the same index record. It would offer a little less flexibility than I was aiming for, but it could reduce amount of records published to IPNI. But then again I'm not sure if optimizing for reduced number of IPNI records at expense of the query flexibility is the right tradeoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not how IPNI is supposed to be used -- you can't throw the whole sharded dag index in the metadata and the metadata is constant across all the CIDS in the index. But I think it presents a question about whether the "IPNI cache" is just an IPNI cache or if we should actually store cached blob-indexes as well. I think we should and I THINK that's what the design doc is saying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not worried about the extra round trip on "cache miss" -- if we cache miss, the hot path is already over. (and to be clear, this is like a double cache miss -- we cache missed the whole retrieval being served from cloudflare cache, then when we setup the index query we cache missed the read of index data. Again, this should be infrequent content, if we outgrow the cache at all (there's a scenario where the cache at least for the IPNI query part as opposed to the blob-index part could be never-evict)


#### Get locations of shards

After getting the shard and slice data, then location commitment data needs to be collected.

If the multihash is that of a whole shard, then the location commitment results are already in the IPNI results. If the multihash is a slice, then another IPNI query needs to be made to retrieve the location(s) of the shard, using the parent shard CID from the blob-index.

The location commitments are read from IPNI metadata and are returned to the cache.

## IPNI Advertisement
Information is provided to IPNI in the form of [IPNI Advertisements](https://github.com/ipni/specs/blob/main/IPNI.md#advertisements) that are signed by w3s and published to IPNI by w3s. When a new advertisement is published, an [IPNI Announcement](https://github.com/ipni/specs/blob/main/IPNI.md#announcements) is sent to IPNI to trigger IPNI to fetch the new advertisement.

A separate advertisement chain is used for location commitments and blob-index data.

The reason for this is that location information is subject to change, but blob index data will not change over the lifetime of the stored data.

### Advertisement Structure

An IPNI Advertisement has the IPLD Schema:

```ipldsch
type Advertisement struct {
PreviousID optional Link
Provider String
Addresses [String]
Signature Bytes
Entries Link
ContextID Bytes
Metadata Bytes
IsRm Bool
ExtendedProvider optional ExtendedProvider
}

type EntryChunk struct {
Entries [Bytes]
Next optional Link
}
```

### IPNI Advertisement for sharded-dag-index

Map all multihashes to blob-index CID/location
ContextID = hash(userKey + blob-index CID)

- `ProviderID` libp2p peer ID of w3s blob-index publisher.
- `Addresses` the most recent value of this is returned with lookups for any CID in any advertisement. It will contain the multiaddr form of the blob-index publisher URL. This is where the blob-index is retrieved from when presented with the blob-index CID.
- `Signature` Signature over all advertisement data. Created using private key matching `ProviderID`.
- `ContextID` byte encoded userKey + blob-index CID.
- `Entries` must be the CID of an `EntryChunk` for a subset (or all) of the multihashes in the blob-index.
- `Metadata` encodes the CID of the blob-index + version info is any.
- `IsRm` is used when removing all advertisement(s) that have that context ID and deleting all associated multihash indexes from IPNI.

The set of multihashes within one blob must be encoded as 1 Advertisement. If a blob is partitioned into multiple sub-blobs this still generates one advertisement as there is still one blob index.

Each advertisement encodes the multihashes into 1 or more EntryChunk blocks. EntryChunk blocks are an array of multihashes with a link to a next block, if there is another block. The EntryChunk serves a similar purpose as HTTP chunking.

All multihashes must be sorted in ascending order before being split into chunks. This is done to have deterministic output when creating entry chinks - the same set of multihashes should generate the same set of entry chunks.

### IPNI Advertisement for location commitment

Map shard-multihash to shard CID/location
ContextID = hash(userKey + shard CID)

- `ProviderID` libp2p peer ID of w3s location publisher.
- `Addresses` multiaddr of location commitment URL. Only used if location commitment does not fit into metadata.
- `ContextID` byte encoded userKey + shard CID
- `Entries` must be the CID of an `EntryChunk` that contains shard CID.
- `Metadata` encodes the shard's location commitment.
- `IsRm` is used when removing the shard from storage

The reason that the shard ID is used to construct the context ID is because individual shards may change location, so only the index information for that shard needs to change.

NOTE: Consider putting the non URL portion of the location commitment into the `Addresses` field and the rest in the metadata.

## Blob Deletion

Removal advertisements are placed on both the blob-index chain and the location chain to delete the blob-index and shard(s) for the user.

To get the shard CID(s), then blob-index must first be read. This may be read from cache as it does not change. The shard CID are read from the blob-index.

## Dependable Reads

Dependable reads are required for ensuring that updates have completed. This means that following a write of index data, that that data can reliably be read. This makes it necessary to have a synchronous write/update operation, after which reads will return the data that was written.

This synchronous write path can be something supported by IPNI. However, that would require a non-standard indexer feature or extension of the protocol. This feature would not be something that would be generally desirable for a global indexer, and may only be appropriate for a site-local indexer.

Another way to support a synchronous write path is to write to the IPNI cache on write, and guarantee that the data remained in the cache until it is indexed in IPNI. An advantage of this write path is that it does not rely on any non-standard behavior of IPNI. It is also more reliable since the critical happens on w3s infrastructure. This ensures that we control the time needed for a read to succeed.

### Cache on write

When an advertisement is created to publish new index data, the IPNI cache is populated with the data retrievable by index queries. An attribute is set on this data that prevents it from being evicted from the cache until it has been verified to be indexed. This ensures that after cache eviction, lookup of the now uncached data will succeed.

## Design questions

1. What are the trade-offs of using IPNI to lookup location commitments as opposed to bundling these with blob-index data.
2. Deploy on w3s gateway, client, both?
3. How can the IPNI cache be distributed? Should its contents be shared on IPFS?
Binary file added rfc/w3up-query-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading