Skip to content

Conversation

@buddhisthead
Copy link
Collaborator

@buddhisthead buddhisthead commented Oct 17, 2025

Implement a complete streaming parser for Cardano Conway-era snapshots that parses the NewEpochState CBOR structure without loading everything into memory.

Key features:

  • Callback-based streaming architecture for UTXOs, pools, accounts, and DReps
  • Parses 11.2M UTXOs at ~2M UTXOs/second
  • Extracts 3,095 stake pools with full metadata (pledge, cost, margin, relays)
  • Parses 1.41M stake accounts with rewards and delegations (SPO & DRep)
  • Extracts 278 DReps with deposits and anchor metadata
  • Uses minicbor Decode trait for type-safe CBOR parsing
  • Adds some documentation about the snapshot file structure. See stuff under docs

Components added:

  • streaming_snapshot.rs: Main streaming parser with callback traits
  • pool_params.rs: Stake pool parameter types with CBOR decoding
  • hash.rs: Generic hash types for pool IDs, VRF keys, addresses (stolen from Amaru)
  • AccountState/StakeAddressState: Shared types for account state. AccountState can be sent on the msg bus.
  • test_streaming_parser.rs: Example showing callback usage
  • Makefile: snap-test-streaming target for testing

TESTING

Run make snap-test-streaming at the top level to see it in action. That invokes a new integration test under examples which reads the giant Epoch 505 snapshot file. There is a README.md file there too for instructions.

- Add streaming_snapshot.rs with callback-based parser
- Add parser.rs with manifest validation
- Add hash.rs and pool_params.rs for snapshot types
- Move Account/AccountDRep types into streaming_snapshot.rs (internal only)
- Add AccountState to stake_addresses.rs for external API
- Add sha2 dependency for integrity checking
- Include test fixtures for validation
Copilot AI review requested due to automatic review settings October 17, 2025 19:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a callback-based streaming snapshot parser for Cardano Conway-era snapshots, plus manifest tooling, fixtures, and documentation.

  • Introduces a streaming parser that navigates NewEpochState and invokes callbacks for UTXOs, pools, accounts, DReps, and proposals
  • Adds manifest parsing and integrity validation utilities and example/Makefile for running the parser against large fixtures
  • Documents Amaru snapshot structure and the parser’s architecture and usage

Reviewed Changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
common/src/snapshot/streaming_snapshot.rs Core streaming parser, data types, and callback traits; includes UTXO streaming, pools/accounts/DReps extraction
common/src/snapshot/pool_params.rs Pool parameter CBOR encode/decode types used by the streaming parser
common/src/snapshot/parser.rs Manifest parsing and integrity validation utilities
common/src/snapshot/mod.rs Module wiring and re-exports
common/src/hash.rs Generic hash utilities and minicbor (de)serialization helpers used by parser types
common/examples/test_streaming_parser.rs Example harness to run the streaming parser and print summaries
scripts/generate_manifest.py Standalone manifest generation for snapshots (filename/header-based), streaming SHA256
docs/streaming-snapshot-parser.md Parser architecture, callbacks, features, and usage
docs/amaru-snapshot-structure.md Detailed description of Amaru snapshot (NewEpochState) structure
tests/fixtures/* LFS pointer files and sample manifests for testing
tests/fixtures/.gitattributes LFS settings for large fixture
common/Cargo.toml Adds sha2 dependency used by integrity validation
common/src/lib.rs Exposes new snapshot and hash modules
Makefile Convenience targets to run parser example and generate manifests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 1434 to 1450
let tx_hash_bytes = match decoder.bytes() {
Ok(b) => b,
Err(_e) => {
errors += 1;
decoder.skip().ok(); // skip remaining TxIn fields and value
continue;
}
};

let output_index = match decoder.u64() {
Ok(idx) => idx,
Err(_e) => {
errors += 1;
decoder.skip().ok(); // skip value
continue;
}
};
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On key-parse errors, only one skip() is issued, which leaves the map value (TxOut) unconsumed and desynchronizes the decoder. When tx_hash decoding fails, you need to skip the remaining element(s) of the TxIn array and then also skip the corresponding TxOut value; similarly, when output_index fails, you must skip the TxOut value as well. Example fix: call skip() for the missing array element and then an additional skip() for the value.

Copilot uses AI. Check for mistakes.
Comment on lines 842 to 845
let mut buffer = Vec::new();
file.read_to_end(&mut buffer).context("Failed to read snapshot file")?;

let mut decoder = Decoder::new(&buffer);
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser reads the entire snapshot into memory, which contradicts the streaming goal and will blow memory for multi‑GB snapshots. Prefer a zero‑copy or low‑copy approach such as memory-mapping (memmap2) or a reader-based decoder to keep memory bounded, and stream UTXOs directly from the file.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We do read the whole damn thing into memory first. I had a previous version with read the UTXOs in by 16M at a time. I'll leave that for another PR so we don't keep pushing this out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - I thought the benefit of streaming/callback was we could read it progressively, at least at the overall section level and with chunks of UTXOs / addresses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I had planned to do this and thought I'd move things along faster by pushing this out to a later version. But I can go a head and address it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. This is way way harder than I thought. My previous "16 MB streamer" wasn't really doing that at all. I'm trying an approach now with ciborium but may just punt this for later since we just need to get boot from snapshot working.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTXO element is a CBOR MAP with 11.2M entries

Average: 194 bytes per UTXO
Minimum: 92 bytes
Maximum: 22,656 bytes (~22KB)
Most common: 100-199 bytes (9.2M entries)

So it should be good for ciborium streaming. Will give it a try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 4.7 seconds to parse the whole file, using max of 256MB (meta data) + 64MB buffer for streaming UTXO parsing.

Comment on lines +104 to +110
" Account #{}: {} (utxo: {}, rewards: {}, pool: {:?}, drep: {:?})",
i + 1,
&account.stake_address[..32],
account.address_state.utxo_value,
account.address_state.rewards,
account.address_state.delegated_spo.as_ref().map(|s| &s[..16]),
account.address_state.delegated_drep
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing delegated_spo bytes with &s[..16] can panic when s.len() < 16. Consider formatting as hex and truncating on the string: let h = hex::encode(s); let shown = &h[..h.len().min(16)];

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +166
eprintln!(
" Proposal #{}: {} (deposit: {}, action: {}, by: {})",
i + 1,
proposal.gov_action_id,
proposal.deposit,
proposal.gov_action,
&proposal.reward_account[..32]
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed-length slicing of reward_account (&[..32]) risks panic if the string is shorter than 32. Use a safe truncation (e.g., let s = &proposal.reward_account; let shown = &s[..s.len().min(32)];).

Suggested change
eprintln!(
" Proposal #{}: {} (deposit: {}, action: {}, by: {})",
i + 1,
proposal.gov_action_id,
proposal.deposit,
proposal.gov_action,
&proposal.reward_account[..32]
let s = &proposal.reward_account;
let shown = &s[..s.len().min(32)];
eprintln!(
" Proposal #{}: {} (deposit: {}, action: {}, by: {})",
i + 1,
proposal.gov_action_id,
proposal.deposit,
proposal.gov_action,
shown

Copilot uses AI. Check for mistakes.
@buddhisthead buddhisthead requested a review from lowhung October 17, 2025 19:20
- Defer on_metadata callback until after parsing deposits from UTxOState[1]
- Parse deposits field correctly (previously hardcoded to 0)
- Fix stream_utxos to consume break marker for indefinite-length maps
- Return UTXO count from stream_utxos for accurate metadata
- Update callback order: metadata now called after all data, before on_complete
- Update documentation to reflect new callback invocation order

Metadata now reports accurate pot balances:
- Deposits: 4,612,238,000,000 lovelace (~4.6M ADA)
- UTXO count: 11,199,911

This ensures correct accounting of the deposits pot which is critical
for bootstrap state distribution.
@buddhisthead buddhisthead requested a review from Copilot October 17, 2025 20:11
}

#[inline]
pub fn entry(&mut self, stake_key: KeyHash) -> Entry<KeyHash, StakeAddressState> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are fixes for warnings so we can eventually run cargo check --warnings and pass.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 10 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +1276 to +1277
reward_account: hex::encode(&params.reward_account.0),
pool_owners: params.owners.iter().map(|h| h.to_string()).collect(),
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PoolInfo.reward_account is documented as Bech32 but is emitted as hex. Convert to Bech32 here or revise the API docs/field name to reflect hex output to avoid consumer confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +55
impl<const BYTES: usize> From<&[u8]> for Hash<BYTES> {
fn from(value: &[u8]) -> Self {
let mut hash = [0; BYTES];
hash.copy_from_slice(value);
Self::new(hash)
}
}
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_from_slice will panic if value.len() != BYTES. Prefer TryFrom<&[u8]> returning a Result to avoid panics, or at minimum assert the length and map to a descriptive error.

Copilot uses AI. Check for mistakes.
BH_FLAG=$${BLOCK_HASH:+--block-hash $$BLOCK_HASH}; \
BHGT_FLAG=$${BLOCK_HEIGHT:+--block-height $$BLOCK_HEIGHT}; \
if [ -f scripts/generate_manifest.py ]; then \
$(PYTHON) scripts/generate_manifest.py $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG $< > $@; \
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script expects the positional snapshot argument before flags (./generate_manifest.py [flags]); current order passes flags first. Fix ordering: $(PYTHON) scripts/generate_manifest.py $&lt; $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG > $@

Suggested change
$(PYTHON) scripts/generate_manifest.py $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG $< > $@; \
$(PYTHON) scripts/generate_manifest.py $< $$ERA_FLAG $$BH_FLAG $$BHGT_FLAG > $@; \

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
134092758.670ca68c3de580f8469677754a725e86ca72a7be381d3108569f0704a5fca327.cbor filter=lfs diff=lfs merge=lfs -text
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the large CBOR is tracked by LFS; snapshot-small.cbor is an LFS pointer but is not covered here, leading to inconsistent local behavior (e.g., SHA256 checks on pointer text). Track all CBOR fixtures with a pattern like *.cbor or add an explicit line for snapshot-small.cbor.

Suggested change
134092758.670ca68c3de580f8469677754a725e86ca72a7be381d3108569f0704a5fca327.cbor filter=lfs diff=lfs merge=lfs -text
*.cbor filter=lfs diff=lfs merge=lfs -text

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,30 @@
// SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may well be the right thing to do but let's discuss - is it implicit in the repo licence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's what Amaru does and I think we're on the same footing there. Also, since we want to be able to use some Amaru code, it makes sense to use the same license.


/// Errors that can occur during snapshot parsing
#[derive(Debug)]
pub enum SnapshotError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - I've been wondering for a while whether we should standardise on error enums instead of convenient but slapdash anyhow!()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thiserror is the way to go since it's hidden without our libraries but generates all the nice things for Error automatically if you annotate your errors slightly.

}

// Serialize implementation requires pallas_addresses which is not currently a dependency
// TODO: Add pallas_addresses or implement Bech32 encoding differently
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have everything you need in address.rs - if not, it could be added there. If it's Pallas specific, it could go in codec/

}

/// Maybe type (optional with explicit encoding)
#[derive(Debug, Clone, PartialEq, Eq)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this instead of Option because we want to add a custom decoder?

let epoch = decoder.u64().context("Failed to parse epoch number")?;

// Skip blocks_previous_epoch [1] and blocks_current_epoch [2]
decoder.skip().context("Failed to skip blocks_previous_epoch")?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need these block counts to bootstrap the rewards calcs in AccountsState

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! Thanks for noting that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of information do we need from the blocks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added block parsing, but all we really get are the number of blocks.


// Extract rewards from rewards_and_deposit (first element of tuple)
let rewards = match &account.rewards_and_deposit {
StrictMaybe::Just((reward, _deposit)) => *reward,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, we don't track per-account deposit at the moment - this makes me wonder whether there's some case that requires us to do so.

stake_address,
address_state: StakeAddressState {
registered: false, // Accounts are registered by SPOState
utxo_value: 0, // Not available in DState, would need to aggregate from UTxOs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AccountsState module will do this for you if you just feed the UTXOState raw UTXOs

- Implement optimized streaming with 64MB parse buffer and 16MB read chunks
- Add batch processing for multiple UTXOs per buffer cycle

Performance Results:
- Memory: 256MB meta-data + 64MB peak (vs 2.1GB original)
- Speed: 2.4M UTXOs/sec in release build (254K in debug)
- Total time: 4.7s for 11.2M UTXOs (release build)
…d the order of it's elements to match Haskell Node CBOR format
@buddhisthead
Copy link
Collaborator Author

@sandtreader I'm inclined to merge this now so I can start working on snapshot booting. I think I've addressed all the main concerns except perhaps the question about utxo_value. Perhaps I'll just remove that element when I work on sending UTXOs on the message bus.

@buddhisthead buddhisthead merged commit 9cd06a7 into main Oct 21, 2025
2 checks passed
@SupernaviX SupernaviX deleted the cet-snap-parser branch October 28, 2025 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants