Heavily distributing a dataset #51

dryajov · 2021-10-26T20:39:18Z

dryajov
Oct 26, 2021
Maintainer

The more I think about it, the more I like the idea of heavily distributing a dataset.
eg: I want to store a 1gb file, I'll ECC to 2gb, and distribute 100mo on 20 nodes

Cons:

More proofs to generate
We can't store plain replicas, since a single node always store less data than the entire dataset, thus we only rely on ECC
- But my understanding was that we wanted to be able to do this anyway?
Don't scale well to small file (if I want to store a 1 block file on 20 nodes, i'll pay for 20 blocks)
- This is also kind of already the case: if we say that a block is 1mb, storing less than 1mb is not going to be efficient. Should we have a dynamic block size?

Pros:

I can loose 10* nodes and not loose any data (* : depends on the ECC obviously)
- Thus, we can could check less frequently, since this is less likely to happen
The more storage nodes you have, the less likely it is that they are co-dependant (in terms of geolocation, gouvernance, etc)

Regarding the number of proof, my understanding is that we could combine multiple dataset proofs as long as the datasets have the same private key & block size.
So, instead of having "1 proof /storage node / contract", we could have "1 proof / storage node / client". If we build the clients to have affinity for storage nodes which already store some of their data, that could reduce the number of proof significantly. (100 trillions dataset in S3, but I'm pretty sure there less than 100 trillions aws customers. Can't find any numbers, but at best should be in millions)

Other idea, if you have enough people in a single contract, they can start to check each others (if they are not under the same gouvernance ofc). We could keep public verifiability, but make each participant of a contract be "aggregators". That would lead to better scalability, since each contract is "self-sufficient"

Could also open other interesting possibilities, but I've run out of characters for this message, so LMK what you think 🙂

by @Menduist

dryajov · 2021-10-26T20:39:32Z

dryajov
Oct 26, 2021
Maintainer Author

The more I think about it, the more I like the idea of heavily distributing a dataset.
eg: I want to store a 1gb file, I'll ECC to 2gb, and distribute 100mo on 20 nodes

Yeah, the more nodes the more security you gain overall and adding heavy ECC on top also provides very strong guarantees. The flipside is well... you need more nodes per dataset, which might not always be possible for less popular datasets.

I don't think there is anything conflicting in Dagger's design that would prevent heavily distributing the dataset. I see the contract as a way of relating a set of chunks to how long this chunks need to be kept by the network. The fact that the chunks can be logically related is secondary and isn't at all a hard requirement.

A general side note as to why we need proofs at all - we don't need them to elevate the overall security of the dataset (tho they do indirectly contribute to it), proofs are there to allow traceability - i.e. rewarding and punishing network participants, this allows us to reliably implement/deploy incentives.

Redundancy is what gives you durability, and proofs and durability are related only to the extent that proofs will help us detect that redundancy of some dataset (or specific chunk) has decreased and punish the node that allowed the dataset/chunk to go amiss. But given enough redundancy, as you noted, already gives us enough probabilistic security that the data wont ever be lost.

Cons:

More proofs to generate

I think this can be mitigated, maybe we can even aggregate proofs from different datasets. But yes, definitely a big tradeoff.

We can't store plain replicas, since a single node always store less data than the entire dataset, thus we only rely on ECC

It depends on what you means by plain replicas, we might have many duplicate chunks in the network, both under same or different contracts, as well as by way of ephemeral/opportunistic caching.

At any rate, if we use systematic ECC it means that we're expanding the dataset with additional chunks, but it doesn't fundamentally change the structure of the dataset, otherwise it wouldn't be systematic.

This is why systematic codes are generally preferred, because decoding (recovery) is usually quite costly and this is also the reason why plain copies in addition to ECC have an advantage over pure ECC, they allow recovering the dataset without having to perform any sort of decoding and only resorting to decoding when enough plaintext pieces have been lost. Keep in mind, that certain ECC does allow for some level of local recovery, but as it happens this two aspects, recovery and redundancy, are orthogonal.

But my understanding was that we wanted to be able to do this anyway?

Yep, see above

Don't scale well to small file (if I want to store a 1 block file on 20 nodes, i'll pay for 20 blocks)

Also a good point

This is also kind of already the case: if we say that a block is 1mb, storing less than 1mb is not going to be efficient.

Yeah, this is a concern as well and this is why we might prefer smaller blocks, say 64KB, which is what Reed-Solomon over GF(2^16) allow us.

Should we have a dynamic block size?

Generally, I'd say that this would complicate the overall design/implementation, but might be worth looking into.

Pros:

I can loose 10* nodes and not loose any data (* : depends on the ECC obviously)

Thus, we can could check less frequently, since this is less likely to happen

Yeah, I guess we need to understand what a lot of nodes means. It's obvious that the more nodes in the network and the more the dataset is distributed across this nodes, the more secure the dataset (given sufficient ECC).

So is 20 nodes a lot or a little?

The more storage nodes you have, the less likely it is that they are co-dependant (in terms of geolocation, gouvernance, etc)

Yep, also a good point

Regarding the number of proof, my understanding is that we could combine multiple dataset proofs as long as the datasets have the same private key & block size.
So, instead of having "1 proof /storage node / contract", we could have "1 proof / storage node / client". If we build the clients to have affinity for storage nodes which already store some of their data, that could reduce the number of proof significantly. (100 trillions dataset in S3, but I'm pretty sure there less than 100 trillions aws customers. Can't find any numbers, but at best should be in millions)

Yeah, this is an interesting idea in general, having nodes aggregate proofs locally for all the datasets they have might not be a bad idea, the problem is not necesarily keys/dataset, rather than the fact that proofs for each dataset might have to be produced at different times, but this is still worth considering. Another way of looking at this would be to apply ZK proofs for all the CPORs generated locally, but at that point we might just use ZK proofs directly and be done with it, certainly something we can look into, the only reason we haven't is time.

Other idea, if you have enough people in a single contract, they can start to check each others (if they are not under the same gouvernance ofc). We could keep public verifiability, but make each participant of a contract be "aggregators". That would lead to better scalability, since each contract is "self-sufficient"

The problem is that no matter how you look at it verifiers need to be staked to be able to verify, otherwise you have the nothing at stake problem, now if storing nodes that are also validators loose stake equally for missing verification as well as providing proofs it could work, but I still see a possibility of colluding since I assume that nodes on the same contract know who the other nodes on that contract are, so you inherently loose pseudo anonymity for the verifiers, which is I think very important to be able to guarantee that the data is still accessible and not being withheld.

In general, I don't see any advantages in having nodes doing cross verification as opposed to having dedicated verifiers?

Great comments overall!!

1 reply

Menduist Oct 28, 2021
Collaborator

Just to clarify: In my original post, I was only talking about LTS contracts, so when I was saying I want to store a 1gb file, I'll ECC to 2gb, and distribute 100mo on 20 nodes
I meant, if I create a contract for my 1gb file, it will be spread across 20 storage nodes, etc

Yeah, this is a concern as well and this is why we might prefer smaller blocks, say 64KB, which is what Reed-Solomon over GF(2^16) allow us.

My understanding was that larger proofs = faster setup & co, correct me if I'm wrong @cskiraly

So is 20 nodes a lot or a little?

At least in my point of view, and from the discussions we where having, we are currently designing the system for ~4 storage nodes per LTS contract, where each storage node stores enough of the dataset to recover it from each storage node.
So, switching to 20 nodes would be "a lot", but if we design the system to allow more duplication, why not go bonkers, and have 500 nodes per contract (if you have a 1to contract, might make sense, for instance)

In general, I don't see any advantages in having nodes doing cross verification as opposed to having dedicated verifiers?

In my mind, each participant would be an aggregator, not a verifier per-se. The issue with dedicated aggregators is that they need to scale linearly with the number of active contracts. That's where having cross-verification helps, because as said, each contract becomes "self sufficient", so it scales automatically.
Thinking further, each contract could become a kind of optimistic rollup, where the committee contains at least the client & every storage nodes, to drastically reduce the blockchains interactions, but that's for another day :p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codex Storage

Heavily distributing a dataset #51

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Codex Storage

Heavily distributing a dataset #51

dryajov Oct 26, 2021 Maintainer

Replies: 1 comment · 1 reply

dryajov Oct 26, 2021 Maintainer Author

Menduist Oct 28, 2021 Collaborator

dryajov
Oct 26, 2021
Maintainer

Replies: 1 comment 1 reply

dryajov
Oct 26, 2021
Maintainer Author

Menduist Oct 28, 2021
Collaborator