Blockchains at heart perform two main functions, they make sure transactions are process correctly and ensure that the data that is the result of these transaction processing is stored securely and are available for more transaction processing in the future.
This post is about all the different ways blockchain can store their data, the guarantees they offer and they trade-offs that are involved.
The first section is about data being "Part of consensus" where data have no special treatment and is used in many simple blockchain such as Bitcoin, Ethereum (pre-4844) and Solana. The next few sections are about more complex data storage schemes such as data availability sampling, sharding and others that are used to decrease storage costs and increase TPS.
When Bitcoin was released data availability was not a topic of discussion. Data was expected to be stored for nodes that are part of consensus since nodes that omitted any blockchain data are at risk of failing to create the next block and lose their mining or staking rewards.
If a Bitcoin node discards an transaction with a positive balance it won't be able to process future transactions where that balance spends is used. That puts that node at risk of receiving an invalid block with balance spend that they can't verify and if they built on that block they will lose their mining rewards when it's that nodes turn to create a new block. Similarly, if an Ethereum node omits data from a smart contract it won't be able to verify blocks with changes in that smart contract and will risk losing their staking rewards (and part of their stake ETH as well).
Storage deals are contracts made with specific storage providers to store the blockchain data. In these marketplaces, providers need to provide proofs every few blocks that they really have the data and based on those proofs they can build up their reputation. As a secondary measure these storage providers have to put up some collateral that is slashed if they miss providing a proof in the time that is required. Users can then pick a provider to store their data or more than one provider for better redundancy and higher assurances that at least one provider will stay online.
A well known implementation of this approach is probably Filecoin storage deals. In the Filecoin network data proofs, called "proofs of spacetime", are required every 30 minutes and have a duration of 180 to 540 days. Note that although Filecoin storage deals can be used to build layer 2s the Filecoin network itself have a conventional BFT consensus.
Less data replication makes these blockchains cheap to store transaction data and more scalable since they have less nodes that need to synced up. On the other hand, less replication is less security as they there's a greater chance something goes wrong and all of these providers go offline. A few projects on the Filecoin network are experimenting with these designs but they are still in early development.
Data availability sampling (DAS) is a method to safeguard data in a blockchain where nodes are required to provide proofs for random pieces, called samples, of the data every few blocks.
This is extremely useful since the bottleneck with most blockchains today is loading smart contract data from the hard drive (as we discussed in "Part of Consensus" section). Since these blockchains don't execute the smart contracts themselves they don't have aforementioned bottlenecks.
Data sampling blockchains can only care about storing the data and act as a hub for other blockchains that perform smart contract execution. If there was just one layer 2 blockchain that uses the data sampling layer 1 it would had the same scalability problem but there could be many layer 2 for every layer and hence they can increase transaction capacity ___ by a lot.
Slashable data sampling blockchains are systems that slashing along with data sampling to ensure nodes are acting honestly. Slashing require nodes to put a collateral up before joining the network which will be taken away from then if they don't follow the rules. These rules usually mean submitting storage proofs in time.
A dedicated data sampling layer is part of Ethereum roadmap and will come with EIP-4844 and the main function of Celestia and Polygon's Avail.
Optimistic data sampling is a similar approach to data sampling described above. They main difference between with slashable data sampling is that providing proofs that the data is stored is incentived with mining rewards but there are no negative incentives. In other words, there is no enforcing or penalties if a node fails to provide these proofs.
An example of a data sampling blockchains without slashing is Arweave. Arweave uses data sampling by requiring the selected block producer to provide DAS proof of a predetermined older block (called recall block). If the block producer provides the proof they will receive the block rewards, if they don't they are not eligible to receive block rewards but nothing further will happen to then and the block will be created by the next block producer who can proof the block data.
Note that optimistic data sampling systems have no collateral and hence can't use proof of stake for consensus. That means they have to employ a different mechanism to pick block producers. Areave, for example, uses a proof of work design similar to Bitcoin that accompanies their data sampling system.
Data sharding works by splitting the blockchain data in "committees" that only handle their assigned portion of the data.
Internet Computer uses data sharding to split the network into "committees" of 13 nodes. These committees are then assigned specific smart contracts (called canisters) and are responsible for handing of their transactions. When committee become saturate from smart contract processing volume the system re-assigns some of the smart contracts to other committees to balance the load.
Note that's different from execution sharding that systems like the Near protocol use. Execution sharding is splitting transaction processing in committees and have a unified state. Data sharding is splitting both transaction processing and data storage in committees. Data sharding is a more extreme form of sharding and make bigger trade-offs between performance and data resilience guarantees.
State expiry is way to store non-permanent data on a blockchain. Storing data temporarily means that the blockchain size grows in a slower pace and hence nodes will eventually be able to store more data.
Proto-dank-sharding or EIP-4844 will employ an implementation of state expiry where nodes will store data for a fixed period of time. More specifically, nodes will be required to provide proofs for the data during the first month they are publish and then the data won't be ensured by the Ethereum consensus but will be left to secondary services like Etherscan or Infura that maintain Ethereum archive nodes.
Up until recently the data layer was responsible solely for storing the layer 1 data but there are now more complicated design that aim to alleviate the scalability problems blockchains face today.
Data as part of consensus is what "simple" blockchains use but can't scale. Data sharding is splitting up data into committes to divide up their work and data availability sampling is separating data storage from execution and providing storage proofs that the data are stored on the nodes. Storage deals are marketplaces of storage providers that can be selected to provide more customize data guarantees and state expiry is only storing blockchain data temporarily to suit more ephemeral applications.
In a future post I talk about execution and the varieties that are seen often in the wild.