Skip to content

Conversation

@treff7es
Copy link
Contributor

@treff7es treff7es commented Nov 21, 2025

Problem

DataHub users were experiencing inconsistent lineage where tables appeared in column-level lineage but
were missing from table-level lineage. This inconsistency can be manifested in two ways:

  1. Incomplete Snowflake metadata: Snowflake's ACCESS_HISTORY.DIRECT_OBJECTS_ACCESSED sometimes omits
    tables that appear in BASE_OBJECTS_ACCESSED, leading to missing tables in table-level lineage
  2. Aggregation gaps: During SQL parsing and aggregation, column-level lineage may reference tables not
    present in the aggregated table-level upstream list

This resulted in incomplete lineage graphs and confusion for users expecting to see all upstream
dependencies.

Solution

Implemented a two-layer defense strategy to guarantee lineage consistency:

Layer 1: Snowflake Source Fix (snowflake_lineage_v2.py)

  • Extracts unique tables from column-level lineage after processing UPSTREAM_COLUMNS
  • Compares with tables from UPSTREAM_TABLES (via directSources)
  • Automatically adds missing tables to the upstream list
  • Logs when Snowflake metadata is incomplete with specific details
  • Tracks metrics: num_tables_added_from_column_lineage

Layer 2: SQL Aggregator Fix (sql_parsing_aggregator.py)

  • During lineage aggregation, extracts all unique table URNs from column-level lineage (cll)
  • Checks for tables present in column lineage but missing from table-level (upstreams)
  • Adds missing tables inline during processing for efficiency
  • Logs each occurrence with debug level for visibility
  • Tracks metrics: num_tables_added_from_column_lineage, num_queries_with_lineage_inconsistencies_fixed

Why Both Layers?

  1. Snowflake-specific issues: Caught at the source layer before aggregation
  2. SQL parser additions: Caught at the aggregator layer when multiple sources are combined
  3. Defense in depth: Guarantees consistency regardless of where the gap originates
  4. Better observability: Separate logs and metrics help identify the root cause

Changes

Modified Files

  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (+56/-12)

    • Added consistency fix in get_known_query_lineage() method (lines 269-306)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py (+4/-0)

    • Added metric: num_tables_added_from_column_lineage
    • Added metric: num_queries_with_empty_directsources
  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (+25/-1)

    • Added consistency fix during lineage aggregation (lines 1379-1401)
    • Added metric: num_tables_added_from_column_lineage
    • Added metric: num_queries_with_lineage_inconsistencies_fixed

Tests Added

  • metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py (+214/-0)
    • test_lineage_consistency_fix_tables_added_from_column_lineage() - Verifies single missing table is
      added
    • test_lineage_consistency_no_fix_needed() - Verifies no changes when consistent
    • test_lineage_consistency_multiple_missing_tables() - Verifies multiple missing tables are added
    • All tests validate both functional correctness and metric tracking

Testing

All 3 new unit tests passing

  • Verified missing tables are correctly added to table-level lineage
  • Verified metrics (num_tables_added_from_column_lineage,
    num_queries_with_lineage_inconsistencies_fixed) are accurately tracked
  • Verified no changes occur when lineage is already consistent
  • Verified handling of multiple missing tables in a single query

Syntax validation: All modified files compile successfully

Logic verification: Set-based algorithm tested with real-world data patterns

Expected Impact

Before Fix

Table-level lineage: 22 tables ❌
Column-level lineage: 26 tables
Missing from table: 4 tables (visible in columns, missing from graph)

After Fix

Table-level lineage: 26 tables ✅
Column-level lineage: 26 tables ✅
Consistency: GUARANTEED

Log Examples

Snowflake Source (INFO level):
Found 4 table(s) in column lineage but not in table lineage for target_table.
This indicates Snowflake's directSources metadata was incomplete.
Adding missing tables to table lineage to ensure consistency.
Missing tables: ['db.schema.table1', 'db.schema.table2', ...]

SQL Aggregator (DEBUG level per occurrence):
Found missing table urn urn:li:dataset:(...) in cll. The query_id was: abc123...

Aggregator Metrics Summary (INFO level):
Added 4 tables from column-level to table-level lineage
Affected queries: 1

Backward Compatibility

Fully backward compatible

  • Only adds missing data, never removes existing data
  • No breaking changes to APIs or data structures
  • No schema changes required

Risk Assessment

Risk Level: Very Low

Why Safe:

  • Defensive fixes only add missing data
  • Well-tested set operations (extraction, comparison, addition)
  • Comprehensive logging for debugging
  • Easy rollback (simple git revert)
  • All tests passing

Related Issues

Fixes lineage inconsistency where users see incomplete upstream dependencies in lineage graphs despite
column-level lineage referencing those tables.

Checklist

  • The PR conforms to DataHub's Contributing
    Guideline
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated
  • Changes are backward compatible
  • No breaking changes or downtime expected

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 21, 2025
@codecov
Copy link

codecov bot commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 83.87097% with 5 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ingestion/source/snowflake/snowflake_lineage_v2.py 72.22% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Nov 21, 2025

✅ Meticulous spotted 0 visual differences across 1017 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit b74b27c. This comment will update as new commits are pushed.

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 21, 2025
@treff7es treff7es changed the title fix(ingest/snowflake): Add workaround for mismatch in discovered column level lineage tables and upstream tables fix(ingest/snowflake/sqlparser): Ensure table-column lineage consistency across Snowflake source and SQL aggregator Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants