-
Notifications
You must be signed in to change notification settings - Fork 5
feat: Adds a streaming snapshot paser #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add streaming_snapshot.rs with callback-based parser - Add parser.rs with manifest validation - Add hash.rs and pool_params.rs for snapshot types - Move Account/AccountDRep types into streaming_snapshot.rs (internal only) - Add AccountState to stake_addresses.rs for external API - Add sha2 dependency for integrity checking - Include test fixtures for validation
…enerates a snapshot manifest file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a callback-based streaming snapshot parser for Cardano Conway-era snapshots, plus manifest tooling, fixtures, and documentation.
- Introduces a streaming parser that navigates NewEpochState and invokes callbacks for UTXOs, pools, accounts, DReps, and proposals
- Adds manifest parsing and integrity validation utilities and example/Makefile for running the parser against large fixtures
- Documents Amaru snapshot structure and the parser’s architecture and usage
Reviewed Changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| common/src/snapshot/streaming_snapshot.rs | Core streaming parser, data types, and callback traits; includes UTXO streaming, pools/accounts/DReps extraction |
| common/src/snapshot/pool_params.rs | Pool parameter CBOR encode/decode types used by the streaming parser |
| common/src/snapshot/parser.rs | Manifest parsing and integrity validation utilities |
| common/src/snapshot/mod.rs | Module wiring and re-exports |
| common/src/hash.rs | Generic hash utilities and minicbor (de)serialization helpers used by parser types |
| common/examples/test_streaming_parser.rs | Example harness to run the streaming parser and print summaries |
| scripts/generate_manifest.py | Standalone manifest generation for snapshots (filename/header-based), streaming SHA256 |
| docs/streaming-snapshot-parser.md | Parser architecture, callbacks, features, and usage |
| docs/amaru-snapshot-structure.md | Detailed description of Amaru snapshot (NewEpochState) structure |
| tests/fixtures/* | LFS pointer files and sample manifests for testing |
| tests/fixtures/.gitattributes | LFS settings for large fixture |
| common/Cargo.toml | Adds sha2 dependency used by integrity validation |
| common/src/lib.rs | Exposes new snapshot and hash modules |
| Makefile | Convenience targets to run parser example and generate manifests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| let tx_hash_bytes = match decoder.bytes() { | ||
| Ok(b) => b, | ||
| Err(_e) => { | ||
| errors += 1; | ||
| decoder.skip().ok(); // skip remaining TxIn fields and value | ||
| continue; | ||
| } | ||
| }; | ||
|
|
||
| let output_index = match decoder.u64() { | ||
| Ok(idx) => idx, | ||
| Err(_e) => { | ||
| errors += 1; | ||
| decoder.skip().ok(); // skip value | ||
| continue; | ||
| } | ||
| }; |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On key-parse errors, only one skip() is issued, which leaves the map value (TxOut) unconsumed and desynchronizes the decoder. When tx_hash decoding fails, you need to skip the remaining element(s) of the TxIn array and then also skip the corresponding TxOut value; similarly, when output_index fails, you must skip the TxOut value as well. Example fix: call skip() for the missing array element and then an additional skip() for the value.
| let mut buffer = Vec::new(); | ||
| file.read_to_end(&mut buffer).context("Failed to read snapshot file")?; | ||
|
|
||
| let mut decoder = Decoder::new(&buffer); |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser reads the entire snapshot into memory, which contradicts the streaming goal and will blow memory for multi‑GB snapshots. Prefer a zero‑copy or low‑copy approach such as memory-mapping (memmap2) or a reader-based decoder to keep memory bounded, and stream UTXOs directly from the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We do read the whole damn thing into memory first. I had a previous version with read the UTXOs in by 16M at a time. I'll leave that for another PR so we don't keep pushing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - I thought the benefit of streaming/callback was we could read it progressively, at least at the overall section level and with chunks of UTXOs / addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I had planned to do this and thought I'd move things along faster by pushing this out to a later version. But I can go a head and address it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is way way harder than I thought. My previous "16 MB streamer" wasn't really doing that at all. I'm trying an approach now with ciborium but may just punt this for later since we just need to get boot from snapshot working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTXO element is a CBOR MAP with 11.2M entries
Average: 194 bytes per UTXO
Minimum: 92 bytes
Maximum: 22,656 bytes (~22KB)
Most common: 100-199 bytes (9.2M entries)
So it should be good for ciborium streaming. Will give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 4.7 seconds to parse the whole file, using max of 256MB (meta data) + 64MB buffer for streaming UTXO parsing.
| " Account #{}: {} (utxo: {}, rewards: {}, pool: {:?}, drep: {:?})", | ||
| i + 1, | ||
| &account.stake_address[..32], | ||
| account.address_state.utxo_value, | ||
| account.address_state.rewards, | ||
| account.address_state.delegated_spo.as_ref().map(|s| &s[..16]), | ||
| account.address_state.delegated_drep |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slicing delegated_spo bytes with &s[..16] can panic when s.len() < 16. Consider formatting as hex and truncating on the string: let h = hex::encode(s); let shown = &h[..h.len().min(16)];
| eprintln!( | ||
| " Proposal #{}: {} (deposit: {}, action: {}, by: {})", | ||
| i + 1, | ||
| proposal.gov_action_id, | ||
| proposal.deposit, | ||
| proposal.gov_action, | ||
| &proposal.reward_account[..32] |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed-length slicing of reward_account (&[..32]) risks panic if the string is shorter than 32. Use a safe truncation (e.g., let s = &proposal.reward_account; let shown = &s[..s.len().min(32)];).
| eprintln!( | |
| " Proposal #{}: {} (deposit: {}, action: {}, by: {})", | |
| i + 1, | |
| proposal.gov_action_id, | |
| proposal.deposit, | |
| proposal.gov_action, | |
| &proposal.reward_account[..32] | |
| let s = &proposal.reward_account; | |
| let shown = &s[..s.len().min(32)]; | |
| eprintln!( | |
| " Proposal #{}: {} (deposit: {}, action: {}, by: {})", | |
| i + 1, | |
| proposal.gov_action_id, | |
| proposal.deposit, | |
| proposal.gov_action, | |
| shown |
- Defer on_metadata callback until after parsing deposits from UTxOState[1] - Parse deposits field correctly (previously hardcoded to 0) - Fix stream_utxos to consume break marker for indefinite-length maps - Return UTXO count from stream_utxos for accurate metadata - Update callback order: metadata now called after all data, before on_complete - Update documentation to reflect new callback invocation order Metadata now reports accurate pot balances: - Deposits: 4,612,238,000,000 lovelace (~4.6M ADA) - UTXO count: 11,199,911 This ensures correct accounting of the deposits pot which is critical for bootstrap state distribution.
| } | ||
|
|
||
| #[inline] | ||
| pub fn entry(&mut self, stake_key: KeyHash) -> Entry<KeyHash, StakeAddressState> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are fixes for warnings so we can eventually run cargo check --warnings and pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 10 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| reward_account: hex::encode(¶ms.reward_account.0), | ||
| pool_owners: params.owners.iter().map(|h| h.to_string()).collect(), |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PoolInfo.reward_account is documented as Bech32 but is emitted as hex. Convert to Bech32 here or revise the API docs/field name to reflect hex output to avoid consumer confusion.
| impl<const BYTES: usize> From<&[u8]> for Hash<BYTES> { | ||
| fn from(value: &[u8]) -> Self { | ||
| let mut hash = [0; BYTES]; | ||
| hash.copy_from_slice(value); | ||
| Self::new(hash) | ||
| } | ||
| } |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_from_slice will panic if value.len() != BYTES. Prefer TryFrom<&[u8]> returning a Result to avoid panics, or at minimum assert the length and map to a descriptive error.
| BH_FLAG=$${BLOCK_HASH:+--block-hash $$BLOCK_HASH}; \ | ||
| BHGT_FLAG=$${BLOCK_HEIGHT:+--block-height $$BLOCK_HEIGHT}; \ | ||
| if [ -f scripts/generate_manifest.py ]; then \ | ||
| $(PYTHON) scripts/generate_manifest.py $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG $< > $@; \ |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script expects the positional snapshot argument before flags (./generate_manifest.py [flags]); current order passes flags first. Fix ordering:
| $(PYTHON) scripts/generate_manifest.py $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG $< > $@; \ | |
| $(PYTHON) scripts/generate_manifest.py $< $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG > $@; \ |
| @@ -0,0 +1 @@ | |||
| 134092758.670ca68c3de580f8469677754a725e86ca72a7be381d3108569f0704a5fca327.cbor filter=lfs diff=lfs merge=lfs -text | |||
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the large CBOR is tracked by LFS; snapshot-small.cbor is an LFS pointer but is not covered here, leading to inconsistent local behavior (e.g., SHA256 checks on pointer text). Track all CBOR fixtures with a pattern like *.cbor or add an explicit line for snapshot-small.cbor.
| 134092758.670ca68c3de580f8469677754a725e86ca72a7be381d3108569f0704a5fca327.cbor filter=lfs diff=lfs merge=lfs -text | |
| *.cbor filter=lfs diff=lfs merge=lfs -text |
| @@ -0,0 +1,30 @@ | |||
| // SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may well be the right thing to do but let's discuss - is it implicit in the repo licence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what Amaru does and I think we're on the same footing there. Also, since we want to be able to use some Amaru code, it makes sense to use the same license.
|
|
||
| /// Errors that can occur during snapshot parsing | ||
| #[derive(Debug)] | ||
| pub enum SnapshotError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - I've been wondering for a while whether we should standardise on error enums instead of convenient but slapdash anyhow!()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thiserror is the way to go since it's hidden without our libraries but generates all the nice things for Error automatically if you annotate your errors slightly.
| } | ||
|
|
||
| // Serialize implementation requires pallas_addresses which is not currently a dependency | ||
| // TODO: Add pallas_addresses or implement Bech32 encoding differently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have everything you need in address.rs - if not, it could be added there. If it's Pallas specific, it could go in codec/
| } | ||
|
|
||
| /// Maybe type (optional with explicit encoding) | ||
| #[derive(Debug, Clone, PartialEq, Eq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this instead of Option because we want to add a custom decoder?
| let epoch = decoder.u64().context("Failed to parse epoch number")?; | ||
|
|
||
| // Skip blocks_previous_epoch [1] and blocks_current_epoch [2] | ||
| decoder.skip().context("Failed to skip blocks_previous_epoch")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need these block counts to bootstrap the rewards calcs in AccountsState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! Thanks for noting that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of information do we need from the blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added block parsing, but all we really get are the number of blocks.
|
|
||
| // Extract rewards from rewards_and_deposit (first element of tuple) | ||
| let rewards = match &account.rewards_and_deposit { | ||
| StrictMaybe::Just((reward, _deposit)) => *reward, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, we don't track per-account deposit at the moment - this makes me wonder whether there's some case that requires us to do so.
| stake_address, | ||
| address_state: StakeAddressState { | ||
| registered: false, // Accounts are registered by SPOState | ||
| utxo_value: 0, // Not available in DState, would need to aggregate from UTxOs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AccountsState module will do this for you if you just feed the UTXOState raw UTXOs
- Implement optimized streaming with 64MB parse buffer and 16MB read chunks - Add batch processing for multiple UTXOs per buffer cycle Performance Results: - Memory: 256MB meta-data + 64MB peak (vs 2.1GB original) - Speed: 2.4M UTXOs/sec in release build (254K in debug) - Total time: 4.7s for 11.2M UTXOs (release build)
…d the order of it's elements to match Haskell Node CBOR format
|
@sandtreader I'm inclined to merge this now so I can start working on snapshot booting. I think I've addressed all the main concerns except perhaps the question about utxo_value. Perhaps I'll just remove that element when I work on sending UTXOs on the message bus. |
Implement a complete streaming parser for Cardano Conway-era snapshots that parses the NewEpochState CBOR structure without loading everything into memory.
Key features:
docsComponents added:
TESTING
Run
make snap-test-streamingat the top level to see it in action. That invokes a new integration test underexampleswhich reads the giant Epoch 505 snapshot file. There is a README.md file there too for instructions.