fix(ingest/snowflake/sqlparser): Ensure table-column lineage consistency across Snowflake source and SQL aggregator #15377
+299
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
DataHub users were experiencing inconsistent lineage where tables appeared in column-level lineage but
were missing from table-level lineage. This inconsistency can be manifested in two ways:
ACCESS_HISTORY.DIRECT_OBJECTS_ACCESSEDsometimes omitstables that appear in
BASE_OBJECTS_ACCESSED, leading to missing tables in table-level lineagepresent in the aggregated table-level upstream list
This resulted in incomplete lineage graphs and confusion for users expecting to see all upstream
dependencies.
Solution
Implemented a two-layer defense strategy to guarantee lineage consistency:
Layer 1: Snowflake Source Fix (
snowflake_lineage_v2.py)UPSTREAM_COLUMNSUPSTREAM_TABLES(viadirectSources)num_tables_added_from_column_lineageLayer 2: SQL Aggregator Fix (
sql_parsing_aggregator.py)cll)upstreams)num_tables_added_from_column_lineage,num_queries_with_lineage_inconsistencies_fixedWhy Both Layers?
Changes
Modified Files
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py(+56/-12)get_known_query_lineage()method (lines 269-306)metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_report.py(+4/-0)num_tables_added_from_column_lineagenum_queries_with_empty_directsourcesmetadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py(+25/-1)num_tables_added_from_column_lineagenum_queries_with_lineage_inconsistencies_fixedTests Added
metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py(+214/-0)test_lineage_consistency_fix_tables_added_from_column_lineage()- Verifies single missing table isadded
test_lineage_consistency_no_fix_needed()- Verifies no changes when consistenttest_lineage_consistency_multiple_missing_tables()- Verifies multiple missing tables are addedTesting
✅ All 3 new unit tests passing
num_tables_added_from_column_lineage,num_queries_with_lineage_inconsistencies_fixed) are accurately tracked✅ Syntax validation: All modified files compile successfully
✅ Logic verification: Set-based algorithm tested with real-world data patterns
Expected Impact
Before Fix
Table-level lineage: 22 tables ❌
Column-level lineage: 26 tables
Missing from table: 4 tables (visible in columns, missing from graph)
After Fix
Table-level lineage: 26 tables ✅
Column-level lineage: 26 tables ✅
Consistency: GUARANTEED
Log Examples
Snowflake Source (INFO level):
Found 4 table(s) in column lineage but not in table lineage for target_table.
This indicates Snowflake's directSources metadata was incomplete.
Adding missing tables to table lineage to ensure consistency.
Missing tables: ['db.schema.table1', 'db.schema.table2', ...]
SQL Aggregator (DEBUG level per occurrence):
Found missing table urn urn:li:dataset:(...) in cll. The query_id was: abc123...
Aggregator Metrics Summary (INFO level):
Added 4 tables from column-level to table-level lineage
Affected queries: 1
Backward Compatibility
✅ Fully backward compatible
Risk Assessment
Risk Level: Very Low
Why Safe:
Related Issues
Fixes lineage inconsistency where users see incomplete upstream dependencies in lineage graphs despite
column-level lineage referencing those tables.
Checklist
Guideline