Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing design for w3s #29
base: main
Are you sure you want to change the base?
Indexing design for w3s #29
Changes from 10 commits
6528294
4eab388
f6870b4
24197d3
f878d55
4f66930
64015b6
f470b46
ca152c6
b898f36
4d4049c
592811c
88e1689
0eb08ec
32ca892
9f10b69
3e7758a
55214d9
038225a
95e01bb
9dd095c
1e4c3db
05d30ae
e215eba
f532427
cc99f9e
703643a
c999858
0c5b91d
9182f9d
ba31778
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This got me thinking that perhaps we don't have to require separate blob/add for the index and allow encoding it along with the content shard itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to keep them separate is because the index describes all shards, and shards may be stored in separate locations. If only the 3rd shard is needed then only the index and that shard need to be read.
It may be acceptable to encode the index as part of the first shard... but then how does the client know how much of the first shard to read? Would there be a fixed-length header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a recursive issue - you can't create an index for a CAR that includes the index...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"lets the query ask" does not read well to me - a query is a question.
Can you change for all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to do this? Typically we want to know where some data is, what other information can we get when we already know where to ask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do I need to do this? How is this different to getting information for a hash of a slice/block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work?
How is this different from a query by site (as above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think caching an empty response can be problematic as we may not have a record on first query which would get cached and prevent us from discovering records on subsequent queries. Perhaps short TTL on empty result could be a way to address this or some form of exponential backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully addressed by read-on-write -- if we're capturing negative responses we need to write the cache when we publish IPNI.
Because otherwise @Gozala makes a very good point and we've seen this exact problem with IPNI and Cloudflare caching -- 404s getting cached for 10 minutes if you query too soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having cache-on-write fixes the case where new data is stored subsequent to a query for that data. When the new data is stored, or location info is updated, that would remove any negative cache. Without cache-on-write then a short TTL is necessary. With cache-on-write, is there still a reason for a short TTL?
Negative cache entries should be kept in a separate cache so that a negative entry cannot evict a positive entry due to LRU, thereby removing the more valuable positive entries and allowing a misbehaving client to empty the cache.
Will clarify in doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure while shard and slice queries are grouped into one type and the location into the other, I suspect there is some reason which is probably worth outlining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately I think it would be nice if this layer was generic - agnostic to kinds of query types. That is to say it should not care what attribute
blob/shard
,blob/slice
orcommitment/location
is queried by it simply can derive the cache key and on cache miss miss translate query into an IPNI query without worrying about the semantics.Maybe that is the intention, but reading this I left with an impression that it's not generic over the attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some need to know what kind of data is being retrieved, not for creating an IPNI query, but to know what to get out of the blob-index indicated in the results. With a shard/slice query, the resulting blob-index must be read to get the CID/offset/length of the shard or slice. With a location query, the resulting blob-index must be read to get the shard location commitments.
Can this be generic when asking for different things? This part probably requires more understanding of the types of answers that a client will ask for, so we can determine if there is a generic way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how smart this query/cache layer should be. I'd be more inclined to have it only allow querying, doing the joins and caching the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to evict information about data even if we don't store that data. It could be that location of the data is not in our system (e.g. shared privately) while index of it is shared publicly. That is to say I don't think we need to worry about removing shard/slice info and instead let the user perform manual eviction when they want to do so. Specifically I imagine user would delete CAR holding an index which could trigger eviction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about a side issue here -- what do we do with two people who upload the same blob but publish different indexes? Seems like this would be relevant. Probably addressed if we add the "account paying" query param that the egress billing design calls for -- does make me wonder if we need to filter CID results not only on "are they W3UP" but also are "are they for the right person for this query?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes in a different direction than what I was imagining. Specifically I thought that IPNI publisher would derive and publish advertisements from the sharded-dag-index so that client querying IPNI would not have to fetch the index. Specifically idea was that multihash would be combined with an attribute / relation being queried allowing us to W3Up IPNI cache to paralellize queries as opposed to having to block on index retrieval before it is able to pull out details.
Part of this was also motivated by desire to make things generic so new kinds of indexes could be introduced by end users or even us without affecting any of the layers of this system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a hybrid model could offer best of both worlds ? E.g. we could use index identifier as a namespace and expect that all attributes under that namespace would be aggregated in the same index record. It would offer a little less flexibility than I was aiming for, but it could reduce amount of records published to IPNI. But then again I'm not sure if optimizing for reduced number of IPNI records at expense of the query flexibility is the right tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not how IPNI is supposed to be used -- you can't throw the whole sharded dag index in the metadata and the metadata is constant across all the CIDS in the index. But I think it presents a question about whether the "IPNI cache" is just an IPNI cache or if we should actually store cached blob-indexes as well. I think we should and I THINK that's what the design doc is saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not worried about the extra round trip on "cache miss" -- if we cache miss, the hot path is already over. (and to be clear, this is like a double cache miss -- we cache missed the whole retrieval being served from cloudflare cache, then when we setup the index query we cache missed the read of index data. Again, this should be infrequent content, if we outgrow the cache at all (there's a scenario where the cache at least for the IPNI query part as opposed to the blob-index part could be never-evict)