Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing design for w3s #29

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Indexing design for w3s #29

wants to merge 31 commits into from

Conversation

gammazero
Copy link
Contributor

@gammazero gammazero commented May 29, 2024

📽️ Preview

Functional design for w3 indexing.

Replaces the previous RFC W3 IPNI Indexing RFC

@gammazero gammazero requested review from Gozala and vasco-santos May 29, 2024 02:04
Copy link
Contributor

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided some feedback inline

rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved

The user creates a blob-index ([sharded-dag-index](https://github.com/w3s-project/specs/blob/main/w3-index.md#sharded-dag-index-example)) that contains the multihash of each shard of blob data, and an array of tuples of (multihash, offset, length) for all the blocks in each shard. The user may choose which blocks to include in the blob-index if they do not want all blocks to be indexed. For example, they may want only the DAG root to be indexed.

The blob-index is stored in the user's space along with the CAR file(s) containing the blob data. This is done in the [w3 add-blob operation](https://github.com/w3s-project/specs/blob/main/w3-blob.md#add-blob). The w3s service creates [location commitments](https://github.com/w3s-project/specs/blob/main/w3-blob.md#location-commitment) for the blob shards and for the blob-index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 This got me thinking that perhaps we don't have to require separate blob/add for the index and allow encoding it along with the content shard itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to keep them separate is because the index describes all shards, and shards may be stored in separate locations. If only the 3rd shard is needed then only the index and that shard need to be read.

It may be acceptable to encode the index as part of the first shard... but then how does the client know how much of the first shard to read? Would there be a fixed-length header?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a recursive issue - you can't create an index for a CAR that includes the index...


The queryable attributes are as follows.
- Site: the location(s) where all blob data is available
- Shard: the location(s) where a specific shard of data is available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this and the above differ

Copy link
Contributor Author

@gammazero gammazero Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Site: The idea is that the query can specify "where site in [X]" so that query response will only contain results that have something stored at a site in [X].

Shard: The idea is that a query can specify to retrieve results for requested shards, i.e. "where shard in [S]", so only results with locations storing the requested shard are returned.

I will try to clarify.

The queryable attributes are as follows.
- Site: the location(s) where all blob data is available
- Shard: the location(s) where a specific shard of data is available
- Slice: the location(s) within a shard where a block of data is available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use of location, while technically correct, is confusing. It's probably best to use a different term for location as in the network address and something when referencing a segment within a hash.

Suggested change
- Slice: the location(s) within a shard where a block of data is available
- Slice: the byte range within a shard containing block of data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea location will naturally be overloaded, but since we use location commitment as part of this design let's try to avoid the word elsewhere.

1. Creating a set of IPNI queries needed to get the requested index content
2. Sending the queries to the w3up IPNI cache
3. Filtering returned results
4. Presenting results to the query client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this improves it, but I think something about aggregating results is worth capturing

Suggested change
4. Presenting results to the query client.
4. Packaging results for the query client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


There are two types of queries send to the cache. Each takes a data multihash, but returns a different result:
1. shard/slice query. Returns one or more: shard CID, slice CID, offset, length (in shard).
2. location query. Returns: location commitment of each shard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure while shard and slice queries are grouped into one type and the location into the other, I suspect there is some reason which is probably worth outlining

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately I think it would be nice if this layer was generic - agnostic to kinds of query types. That is to say it should not care what attribute blob/shard, blob/slice or commitment/location is queried by it simply can derive the cache key and on cache miss miss translate query into an IPNI query without worrying about the semantics.

Maybe that is the intention, but reading this I left with an impression that it's not generic over the attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some need to know what kind of data is being retrieved, not for creating an IPNI query, but to know what to get out of the blob-index indicated in the results. With a shard/slice query, the resulting blob-index must be read to get the CID/offset/length of the shard or slice. With a location query, the resulting blob-index must be read to get the shard location commitments.

Can this be generic when asking for different things? This part probably requires more understanding of the types of answers that a client will ask for, so we can determine if there is a generic way to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how smart this query/cache layer should be. I'd be more inclined to have it only allow querying, doing the joins and caching the result.

- Temporary. Evicts items that have been cached for longer than the cache expiration time, regardless of last access. This allows changes to query results to be seen.
- Negative (empty) result-aware. Caching empty results prevents more expensive IPNI queries for data that is not indexed.

This is a read-through cache, meaning if it does not hold the request query results, then it forwards the query on to IPNI, and caches the results. This includes caching an empty response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caching an empty response can be problematic as we may not have a record on first query which would get cached and prevent us from discovering records on subsequent queries. Perhaps short TTL on empty result could be a way to address this or some form of exponential backoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully addressed by read-on-write -- if we're capturing negative responses we need to write the cache when we publish IPNI.

Because otherwise @Gozala makes a very good point and we've seen this exact problem with IPNI and Cloudflare caching -- 404s getting cached for 10 minutes if you query too soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having cache-on-write fixes the case where new data is stored subsequent to a query for that data. When the new data is stored, or location info is updated, that would remove any negative cache. Without cache-on-write then a short TTL is necessary. With cache-on-write, is there still a reason for a short TTL?

Negative cache entries should be kept in a separate cache so that a negative entry cannot evict a positive entry due to LRU, thereby removing the more valuable positive entries and allowing a misbehaving client to empty the cache.

Will clarify in doc.

If either shard or location results are not cached, then a query for the multihash is passed to the IPNI client.

#### Expiration Times
The shard/slice data does not change over the lifetime of the stored data, so this data can have a much longer expiration time. It could also have no expiration time and be explicitly removed from cache when the data is no longer stored, but this means that there need to be a communication path from data removal to all cache locations. So, better to just have an expiration time, making cache more independent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to evict information about data even if we don't store that data. It could be that location of the data is not in our system (e.g. shared privately) while index of it is shared publicly. That is to say I don't think we need to worry about removing shard/slice info and instead let the user perform manual eviction when they want to do so. Specifically I imagine user would delete CAR holding an index which could trigger eviction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about a side issue here -- what do we do with two people who upload the same blob but publish different indexes? Seems like this would be relevant. Probably addressed if we add the "account paying" query param that the egress billing design calls for -- does make me wonder if we need to filter CID results not only on "are they W3UP" but also are "are they for the right person for this query?"


After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from.

The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes in a different direction than what I was imagining. Specifically I thought that IPNI publisher would derive and publish advertisements from the sharded-dag-index so that client querying IPNI would not have to fetch the index. Specifically idea was that multihash would be combined with an attribute / relation being queried allowing us to W3Up IPNI cache to paralellize queries as opposed to having to block on index retrieval before it is able to pull out details.

Part of this was also motivated by desire to make things generic so new kinds of indexes could be introduced by end users or even us without affecting any of the layers of this system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a hybrid model could offer best of both worlds ? E.g. we could use index identifier as a namespace and expect that all attributes under that namespace would be aggregated in the same index record. It would offer a little less flexibility than I was aiming for, but it could reduce amount of records published to IPNI. But then again I'm not sure if optimizing for reduced number of IPNI records at expense of the query flexibility is the right tradeoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not how IPNI is supposed to be used -- you can't throw the whole sharded dag index in the metadata and the metadata is constant across all the CIDS in the index. But I think it presents a question about whether the "IPNI cache" is just an IPNI cache or if we should actually store cached blob-indexes as well. I think we should and I THINK that's what the design doc is saying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not worried about the extra round trip on "cache miss" -- if we cache miss, the hot path is already over. (and to be clear, this is like a double cache miss -- we cache missed the whole retrieval being served from cloudflare cache, then when we setup the index query we cache missed the read of index data. Again, this should be infrequent content, if we outgrow the cache at all (there's a scenario where the cache at least for the IPNI query part as opposed to the blob-index part could be never-evict)

Copy link
Member

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, excellent. This is a really good design and I think we're close.

I have comments about queries I'll put in the query PR.

However, the big thing that needs clarification is the caching behavior for blob-index/sharded-dag-indexes -- to me, I think we need to cache these for super high value content, so that the fast path for a completely cached index query never has a read to an external storage provider (remember that eventually our storage nodes could have higher latency from the gateway). I'm open to debate here but it seems implied in various places.

I think this is basically ready to start implementing, but lets keep going a little more.

rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Show resolved Hide resolved

After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from.

The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not how IPNI is supposed to be used -- you can't throw the whole sharded dag index in the metadata and the metadata is constant across all the CIDS in the index. But I think it presents a question about whether the "IPNI cache" is just an IPNI cache or if we should actually store cached blob-indexes as well. I think we should and I THINK that's what the design doc is saying.


After selecting only w3up results, these must then be processed to generate responses for shard/slice and location query attributes. To do this, first the blob-index result is read from the IPNI result metadata. This gets the CID of the sharded-dag-index (aka blob-index) and the location portion of the result gives the base URL to retrieve the blob-index from.

The blob-index is retrieved and read. The blob-index data is searched to find the multihash, and to get the shard and slice (if the multihash is a slice within a shard).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not worried about the extra round trip on "cache miss" -- if we cache miss, the hot path is already over. (and to be clear, this is like a double cache miss -- we cache missed the whole retrieval being served from cloudflare cache, then when we setup the index query we cache missed the read of index data. Again, this should be infrequent content, if we outgrow the cache at all (there's a scenario where the cache at least for the IPNI query part as opposed to the blob-index part could be never-evict)


### W3up IPNI Cache

The w3up IPNI Cache is a cache that holds IPNI query results. The cache is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is my most substantive comment:

Is this an IPNI cache? Or is it a IPNI + sharded-dag-index cache?

As I understand it, IPNI results will ONLY contain a location to retrieve a Sharded Dag Index.

You'll need to read that sharded-dag-index / blob-index seperately (sidebar: I'm finding the distinction very difficult to understand -- can we have a terms section)

So are we caching that? I think the answer is yes, cause otherwise every query contains an extra round trip. I imagine these sharded-dag-indexes are evicted more often than the IPNI queries, since they're going to be larger. But I think to support a shard/slice query without blocking on an additional round trip, we should be caching these.

And if we're caching indexes, I think we should be explicit and not call this an IPNI cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an IPNI + sharded-dag-index cache. I need to make that clear in the document.

gammazero and others added 4 commits June 3, 2024 21:16
@hannahhoward
Copy link
Member

OK now I read @Gozala's PR and I get his confusion and how it relates to my general question about caching the blob-indexes, which would not be needed if we implement what he suggests, but it too me a couple reads to understand what I believe he's suggesting, which may actually be a good idea. We'll see!

I setup a quick sync to get us on the same page.

@hannahhoward
Copy link
Member

hannahhoward commented Jun 5, 2024

@alanshaw can you take a look at this when you have a chance?

@gammazero after Alan's review I'd like to promote this to the system architecture repo. We can continue to evolve as we go.

I think it would be good to capture some of the various outcomes of todays discussions as well. (written summary below)


  1. The cache will translate both IPNI requests and claims bundle (i.e. sharded dag index) into a format optimized for fast querying of lots of data and supporting arbitrary relationships between multihashs. Likely some kind of EAV store
    1. There’ll be a ticket to design this data model, and maybe another to start protoyping.
  2. The claims bundle / sharded dag index effectively captures 2 types of relationships:
    1. blob/shard (how is the blob broken up into multiple shards)
    2. blob/slice (what are the offsets of individual blocks or groups of blocks in the the shard)
    3. in the future, it’s possible there are other types of relationships we will want to store. we should be able to add additional relationships and potentially publish them to IPNI without redesigning the system
  3. For IPNI:
    1. You can query by a shard multihash to get the location commitments for that shard. Each location commitment for each shard will have a contextID as they update independently.
    2. You can query by any multihash in DAG to get the address of the claims bundle for the DAG. These will all updated under a single context ID. Importantly, each relationship will have a separate metadata protocol. There are a number of ways to do this on a single context ID:
      1. Actually, rereading spec and https://github.com/ipni/go-libipni/blob/main/metadata/metadata.go, a single “Metadata” as specified by IPNI is MULTIPLE metadata protocols — so Andrew I believe this can be accomplished by just including multiple metadata protocols in the metadata for the main advertisement for the context ID. Metadata can also be updated in a future advertisement, so you could add future relations this way.
      2. That said we could also use extended providers here and maybe it would be useful to do so. Mostly if different bundles start ending up at different publishers
      3. Anyway, there’s going to a ticket for laser focus on the metadata and IPNI ad structures so we can figure it out then.
    3. Worth noting we still need to keep our existing advertisement chain going, cause this is what ipfs.io and other IPFS clients know how to use for now.
  4. For the Index query:
    1. I’m ok with a query interface instead of a rest api for the long term. I get the rationale. I honestly think the first version should be a REST Api for the one query join we actually want. We can certainly build the query interface right away, but I would not build a generalized query parser and implementation. Instead I would just pattern match the queries against the specific joins we want to do for the moment and error on everything other than the thing we need. Again, for the moment.
    2. There’ll be a prototype ticket for this service.


The user creates a blob-index ([sharded-dag-index](https://github.com/w3s-project/specs/blob/main/w3-index.md#sharded-dag-index-example)) that contains the multihash of each shard of blob data, and an array of tuples of (multihash, offset, length) for all the blocks in each shard. The user may choose which blocks to include in the blob-index if they do not want all blocks to be indexed. For example, they may want only the DAG root to be indexed.

The blob-index is stored in the user's space along with the CAR file(s) containing the blob data. This is done in the [w3 add-blob operation](https://github.com/w3s-project/specs/blob/main/w3-blob.md#add-blob). The w3s service creates [location commitments](https://github.com/w3s-project/specs/blob/main/w3-blob.md#location-commitment) for the blob shards and for the blob-index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a recursive issue - you can't create an index for a CAR that includes the index...

rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved

There are two types of queries send to the cache. Each takes a data multihash, but returns a different result:
1. shard/slice query. Returns one or more: shard CID, slice CID, offset, length (in shard).
2. location query. Returns: location commitment of each shard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how smart this query/cache layer should be. I'd be more inclined to have it only allow querying, doing the joins and caching the result.

gammazero and others added 4 commits June 11, 2024 10:04
@gammazero gammazero requested review from alanshaw and Gozala June 12, 2024 14:51
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved
rfc/indexing-design.md Outdated Show resolved Hide resolved

After the add-blob process completes, and the user receives an [accept-blob receipt](https://github.com/w3s-project/specs/blob/main/w3-blob.md#accept-blob-receipt), the user may then choose to make their data publicly queryable by publishing it to W3UP's indexing system. The user optionally invokes an [index-add](https://github.com/w3s-project/specs/blob/main/w3-index.md#index-add) capability to publish the sharded-dag-index multihashes into W3UP's indexing system and eventually to [IPNI](https://github.com/ipni/specs/blob/main/IPNI.md) so that they can be used to look up location commitments and retrieve sharded-dag-index information and blob data. See [W3 Index](https://github.com/w3s-project/specs/blob/main/w3-index.md#w3-index) for more.

After publishing an index, any user looking for content can query for the sharded-dag-index of that content. The user can read the sharded-dag-index to determine what data shard needs to be retrieved, and then asks for the location of that shard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After publishing an index, any user looking for content can query for the sharded-dag-index of that content. The user can read the sharded-dag-index to determine what data shard needs to be retrieved, and then asks for the location of that shard.
After publishing an index, any user looking for content can query for the sharded-dag-index of that content. The user can read the sharded-dag-index to determine which blob(s) data needs to be read from, and then asks for the location of those blob(s).

rfc/indexing-design.md Outdated Show resolved Hide resolved
gammazero and others added 4 commits June 17, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants