Skip to content

Add Printable Dependency Tree for FlowGraph #101

@Edwardvaneechoud

Description

@Edwardvaneechoud

Summary

Add methods to FlowGraph and FlowFrame to print human-readable dependency trees and pipeline summaries, making it easier to inspect and debug data pipelines without opening the visual editor.

Problem

Currently, inspecting a FlowGraph's structure requires either:

  1. Opening the visual editor with ff.open_graph_in_editor()
  2. Looking at the unhelpful string representation that just lists node IDs
# Current output is not very useful
print(large_pipeline)
# FlowGraph(Nodes: {13: Node id: 13 (manual_input), 14: Node id: 14 (polars_code), 15: Node id: 15 (select), 16: Node id: 16 (group_by)})

This makes it difficult to quickly understand pipeline structure during development, debugging, or when working in environments where the visual editor isn't practical (like remote servers, CI/CD, or when sharing code examples).

Proposed Solution

Add several methods to provide text-based pipeline visualization:

1. FlowGraph.print_tree() - Dependency Tree View

pipeline.flow_graph.print_tree()
# Manual Input (id=13)
# └── Filter (id=14): quality_score > 0.9
#     └── Select (id=15): ["id", "value", "category"]  
#         └── Group By (id=16): group_by=["category"], agg=[mean(value)]

# With descriptions
pipeline.flow_graph.print_tree(show_descriptions=True)
# Manual Input (id=13): "Load simulated data"
# └── Filter (id=14): "Reduces data early"
#     └── Select (id=15): "Reduces columns early"
#         └── Group By (id=16): "Aggregate by category"

# With schema information
pipeline.flow_graph.print_tree(show_schema=True)
# Manual Input (id=13) → [id: i64, quality_score: f64, value: i64, category: str]
# └── Filter (id=14) → [id: i64, quality_score: f64, value: i64, category: str]
#     └── Select (id=15) → [id: i64, value: i64, category: str]
#         └── Group By (id=16) → [category: str, value: f64]

2. FlowFrame.print_lineage() Current Node's Path

pipeline.print_lineage()
# Manual Input → Filter → Select → Group By (current)

3. FlowGraph.print_execution_order() - Execution Sequence

pipeline.flow_graph.print_execution_order()
# Execution order:
# 1. Manual Input (id=13)
# 2. Filter (id=14) 
# 3. Select (id=15)
# 4. Group By (id=16)

# Compact format
pipeline.flow_graph.print_execution_order(compact=True)
# Execution order: [13→14→15→16]

# With estimated timing (if available)
pipeline.flow_graph.print_execution_order(show_timing=True)
# Execution order:
# 1. Manual Input (id=13) - ~0.1s
# 2. Filter (id=14) - ~0.3s
# 3. Select (id=15) - ~0.1s  
# 4. Group By (id=16) - ~0.5s
# Total estimated time: ~1.0s

4. FlowGraph.print_summary() - Node Details Summary

pipeline.flow_graph.print_summary()
# Pipeline Summary (4 nodes):
# ┌─────────────────────────────────────────────────────────────────┐
# │ Node 13 (manual_input): Manual Input                           │
# │   Description: Load simulated data                             │
# │   Output: 4 columns, 10000 rows                               │
# │   Schema: [id: i64, quality_score: f64, value: i64, ...]      │
# ├─────────────────────────────────────────────────────────────────┤
# │ Node 14 (filter): Filter                                       │
# │   Description: Reduces data early                              │
# │   Condition: quality_score > 0.9                              │
# │   Output: 4 columns, ~2500 rows (estimated)                   │
# ├─────────────────────────────────────────────────────────────────┤
# │ Node 15 (select): Select                                       │
# │   Description: Reduces columns early                           │
# │   Columns: ["id", "value", "category"]                        │
# │   Output: 3 columns, ~2500 rows                               │
# ├─────────────────────────────────────────────────────────────────┤
# │ Node 16 (group_by): Group By                                   │
# │   Description: Aggregate by category                           │
# │   Group by: ["category"], Agg: [mean(value)]                  │
# │   Output: 2 columns, 4 rows                                   │
# └─────────────────────────────────────────────────────────────────┘

# Compact summary
pipeline.flow_graph.print_summary(compact=True)
# Node 13: Manual Input → 4 cols, 10000 rows
# Node 14: Filter (quality_score > 0.9) → 4 cols, ~2500 rows  
# Node 15: Select (3 columns) → 3 cols, ~2500 rows
# Node 16: Group By (category) → 2 cols, 4 rows

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions