-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing design for w3s #29
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided some feedback inline
rfc/indexing-design.md
Outdated
|
||
The user creates a blob-index ([sharded-dag-index](https://github.com/w3s-project/specs/blob/main/w3-index.md#sharded-dag-index-example)) that contains the multihash of each shard of blob data, and an array of tuples of (multihash, offset, length) for all the blocks in each shard. The user may choose which blocks to include in the blob-index if they do not want all blocks to be indexed. For example, they may want only the DAG root to be indexed. | ||
|
||
The blob-index is stored in the user's space along with the CAR file(s) containing the blob data. This is done in the [w3 add-blob operation](https://github.com/w3s-project/specs/blob/main/w3-blob.md#add-blob). The w3s service creates [location commitments](https://github.com/w3s-project/specs/blob/main/w3-blob.md#location-commitment) for the blob shards and for the blob-index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This got me thinking that perhaps we don't have to require separate blob/add for the index and allow encoding it along with the content shard itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to keep them separate is because the index describes all shards, and shards may be stored in separate locations. If only the 3rd shard is needed then only the index and that shard need to be read.
It may be acceptable to encode the index as part of the first shard... but then how does the client know how much of the first shard to read? Would there be a fixed-length header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a recursive issue - you can't create an index for a CAR that includes the index...
rfc/indexing-design.md
Outdated
|
||
The queryable attributes are as follows. | ||
- Site: the location(s) where all blob data is available | ||
- Shard: the location(s) where a specific shard of data is available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this and the above differ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Site
: The idea is that the query can specify "where site in [X]" so that query response will only contain results that have something stored at a site in [X].
Shard
: The idea is that a query can specify to retrieve results for requested shards, i.e. "where shard in [S]", so only results with locations storing the requested shard are returned.
I will try to clarify.
rfc/indexing-design.md
Outdated
The queryable attributes are as follows. | ||
- Site: the location(s) where all blob data is available | ||
- Shard: the location(s) where a specific shard of data is available | ||
- Slice: the location(s) within a shard where a block of data is available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think use of location, while technically correct, is confusing. It's probably best to use a different term for location as in the network address and something when referencing a segment within a hash.
- Slice: the location(s) within a shard where a block of data is available | |
- Slice: the byte range within a shard containing block of data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea location will naturally be overloaded, but since we use location commitment as part of this design let's try to avoid the word elsewhere.
rfc/indexing-design.md
Outdated
1. Creating a set of IPNI queries needed to get the requested index content | ||
2. Sending the queries to the w3up IPNI cache | ||
3. Filtering returned results | ||
4. Presenting results to the query client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this improves it, but I think something about aggregating results is worth capturing
4. Presenting results to the query client. | |
4. Packaging results for the query client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
There are two types of queries send to the cache. Each takes a data multihash, but returns a different result: | ||
1. shard/slice query. Returns one or more: shard CID, slice CID, offset, length (in shard). | ||
2. location query. Returns: location commitment of each shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure while shard and slice queries are grouped into one type and the location into the other, I suspect there is some reason which is probably worth outlining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately I think it would be nice if this layer was generic - agnostic to kinds of query types. That is to say it should not care what attribute blob/shard
, blob/slice
or commitment/location
is queried by it simply can derive the cache key and on cache miss miss translate query into an IPNI query without worrying about the semantics.
Maybe that is the intention, but reading this I left with an impression that it's not generic over the attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some need to know what kind of data is being retrieved, not for creating an IPNI query, but to know what to get out of the blob-index indicated in the results. With a shard/slice query, the resulting blob-index must be read to get the CID/offset/length of the shard or slice. With a location query, the resulting blob-index must be read to get the shard location commitments.
Can this be generic when asking for different things? This part probably requires more understanding of the types of answers that a client will ask for, so we can determine if there is a generic way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how smart this query/cache layer should be. I'd be more inclined to have it only allow querying, doing the joins and caching the result.
rfc/indexing-design.md
Outdated
- Temporary. Evicts items that have been cached for longer than the cache expiration time, regardless of last access. This allows changes to query results to be seen. | ||
- Negative (empty) result-aware. Caching empty results prevents more expensive IPNI queries for data that is not indexed. | ||
|
||
This is a read-through cache, meaning if it does not hold the request query results, then it forwards the query on to IPNI, and caches the results. This includes caching an empty response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think caching an empty response can be problematic as we may not have a record on first query which would get cached and prevent us from discovering records on subsequent queries. Perhaps short TTL on empty result could be a way to address this or some form of exponential backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully addressed by read-on-write -- if we're capturing negative responses we need to write the cache when we publish IPNI.
Because otherwise @Gozala makes a very good point and we've seen this exact problem with IPNI and Cloudflare caching -- 404s getting cached for 10 minutes if you query too soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having cache-on-write fixes the case where new data is stored subsequent to a query for that data. When the new data is stored, or location info is updated, that would remove any negative cache. Without cache-on-write then a short TTL is necessary. With cache-on-write, is there still a reason for a short TTL?
Negative cache entries should be kept in a separate cache so that a negative entry cannot evict a positive entry due to LRU, thereby removing the more valuable positive entries and allowing a misbehaving client to empty the cache.
Will clarify in doc.
rfc/indexing-design.md
Outdated
If either shard or location results are not cached, then a query for the multihash is passed to the IPNI client. | ||
|
||
#### Expiration Times | ||
The shard/slice data does not change over the lifetime of the stored data, so this data can have a much longer expiration time. It could also have no expiration time and be explicitly removed from cache when the data is no longer stored, but this means that there need to be a communication path from data removal to all cache locations. So, better to just have an expiration time, making cache more independent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to evict information about data even if we don't store that data. It could be that location of the data is not in our system (e.g. shared privately) while index of it is shared publicly. That is to say I don't think we need to worry about removing shard/slice info and instead let the user perform manual eviction when they want to do so. Specifically I imagine user would delete CAR holding an index which could trigger eviction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about a side issue here -- what do we do with two people who upload the same blob but publish different indexes? Seems like this would be relevant. Probably addressed if we add the "account paying" query param that the egress billing design calls for -- does make me wonder if we need to filter CID results not only on "are they W3UP" but also are "are they for the right person for this query?"
rfc/indexing-design.md
Outdated
|
||
After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from. | ||
|
||
The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes in a different direction than what I was imagining. Specifically I thought that IPNI publisher would derive and publish advertisements from the sharded-dag-index so that client querying IPNI would not have to fetch the index. Specifically idea was that multihash would be combined with an attribute / relation being queried allowing us to W3Up IPNI cache to paralellize queries as opposed to having to block on index retrieval before it is able to pull out details.
Part of this was also motivated by desire to make things generic so new kinds of indexes could be introduced by end users or even us without affecting any of the layers of this system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a hybrid model could offer best of both worlds ? E.g. we could use index identifier as a namespace and expect that all attributes under that namespace would be aggregated in the same index record. It would offer a little less flexibility than I was aiming for, but it could reduce amount of records published to IPNI. But then again I'm not sure if optimizing for reduced number of IPNI records at expense of the query flexibility is the right tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not how IPNI is supposed to be used -- you can't throw the whole sharded dag index in the metadata and the metadata is constant across all the CIDS in the index. But I think it presents a question about whether the "IPNI cache" is just an IPNI cache or if we should actually store cached blob-indexes as well. I think we should and I THINK that's what the design doc is saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not worried about the extra round trip on "cache miss" -- if we cache miss, the hot path is already over. (and to be clear, this is like a double cache miss -- we cache missed the whole retrieval being served from cloudflare cache, then when we setup the index query we cache missed the read of index data. Again, this should be infrequent content, if we outgrow the cache at all (there's a scenario where the cache at least for the IPNI query part as opposed to the blob-index part could be never-evict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, excellent. This is a really good design and I think we're close.
I have comments about queries I'll put in the query PR.
However, the big thing that needs clarification is the caching behavior for blob-index/sharded-dag-indexes -- to me, I think we need to cache these for super high value content, so that the fast path for a completely cached index query never has a read to an external storage provider (remember that eventually our storage nodes could have higher latency from the gateway). I'm open to debate here but it seems implied in various places.
I think this is basically ready to start implementing, but lets keep going a little more.
rfc/indexing-design.md
Outdated
|
||
After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from. | ||
|
||
The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not how IPNI is supposed to be used -- you can't throw the whole sharded dag index in the metadata and the metadata is constant across all the CIDS in the index. But I think it presents a question about whether the "IPNI cache" is just an IPNI cache or if we should actually store cached blob-indexes as well. I think we should and I THINK that's what the design doc is saying.
rfc/indexing-design.md
Outdated
|
||
After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from. | ||
|
||
The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not worried about the extra round trip on "cache miss" -- if we cache miss, the hot path is already over. (and to be clear, this is like a double cache miss -- we cache missed the whole retrieval being served from cloudflare cache, then when we setup the index query we cache missed the read of index data. Again, this should be infrequent content, if we outgrow the cache at all (there's a scenario where the cache at least for the IPNI query part as opposed to the blob-index part could be never-evict)
rfc/indexing-design.md
Outdated
|
||
### W3up IPNI Cache | ||
|
||
The w3up IPNI Cache is a cache that holds IPNI query results. The cache is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is my most substantive comment:
Is this an IPNI cache? Or is it a IPNI + sharded-dag-index cache?
As I understand it, IPNI results will ONLY contain a location to retrieve a Sharded Dag Index.
You'll need to read that sharded-dag-index / blob-index seperately (sidebar: I'm finding the distinction very difficult to understand -- can we have a terms section)
So are we caching that? I think the answer is yes, cause otherwise every query contains an extra round trip. I imagine these sharded-dag-indexes are evicted more often than the IPNI queries, since they're going to be larger. But I think to support a shard/slice query without blocking on an additional round trip, we should be caching these.
And if we're caching indexes, I think we should be explicit and not call this an IPNI cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is an IPNI + sharded-dag-index cache. I need to make that clear in the document.
Co-authored-by: Hannah Howard <[email protected]>
Co-authored-by: Hannah Howard <[email protected]>
Co-authored-by: Hannah Howard <[email protected]>
Co-authored-by: Hannah Howard <[email protected]>
OK now I read @Gozala's PR and I get his confusion and how it relates to my general question about caching the blob-indexes, which would not be needed if we implement what he suggests, but it too me a couple reads to understand what I believe he's suggesting, which may actually be a good idea. We'll see! I setup a quick sync to get us on the same page. |
@alanshaw can you take a look at this when you have a chance? @gammazero after Alan's review I'd like to promote this to the system architecture repo. We can continue to evolve as we go. I think it would be good to capture some of the various outcomes of todays discussions as well. (written summary below)
|
rfc/indexing-design.md
Outdated
|
||
The user creates a blob-index ([sharded-dag-index](https://github.com/w3s-project/specs/blob/main/w3-index.md#sharded-dag-index-example)) that contains the multihash of each shard of blob data, and an array of tuples of (multihash, offset, length) for all the blocks in each shard. The user may choose which blocks to include in the blob-index if they do not want all blocks to be indexed. For example, they may want only the DAG root to be indexed. | ||
|
||
The blob-index is stored in the user's space along with the CAR file(s) containing the blob data. This is done in the [w3 add-blob operation](https://github.com/w3s-project/specs/blob/main/w3-blob.md#add-blob). The w3s service creates [location commitments](https://github.com/w3s-project/specs/blob/main/w3-blob.md#location-commitment) for the blob shards and for the blob-index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a recursive issue - you can't create an index for a CAR that includes the index...
|
||
There are two types of queries send to the cache. Each takes a data multihash, but returns a different result: | ||
1. shard/slice query. Returns one or more: shard CID, slice CID, offset, length (in shard). | ||
2. location query. Returns: location commitment of each shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how smart this query/cache layer should be. I'd be more inclined to have it only allow querying, doing the joins and caching the result.
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
rfc/indexing-design.md
Outdated
|
||
After the add-blob process completes, and the user receives an [accept-blob receipt](https://github.com/w3s-project/specs/blob/main/w3-blob.md#accept-blob-receipt), the user may then choose to make their data publicly queryable by publishing it to W3UP's indexing system. The user optionally invokes an [index-add](https://github.com/w3s-project/specs/blob/main/w3-index.md#index-add) capability to publish the sharded-dag-index multihashes into W3UP's indexing system and eventually to [IPNI](https://github.com/ipni/specs/blob/main/IPNI.md) so that they can be used to look up location commitments and retrieve sharded-dag-index information and blob data. See [W3 Index](https://github.com/w3s-project/specs/blob/main/w3-index.md#w3-index) for more. | ||
|
||
After publishing an index, any user looking for content can query for the sharded-dag-index of that content. The user can read the sharded-dag-index to determine what data shard needs to be retrieved, and then asks for the location of that shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After publishing an index, any user looking for content can query for the sharded-dag-index of that content. The user can read the sharded-dag-index to determine what data shard needs to be retrieved, and then asks for the location of that shard. | |
After publishing an index, any user looking for content can query for the sharded-dag-index of that content. The user can read the sharded-dag-index to determine which blob(s) data needs to be read from, and then asks for the location of those blob(s). |
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
📽️ Preview
Functional design for w3 indexing.
Replaces the previous RFC W3 IPNI Indexing RFC