Skip to content

opendedup/data-graphql-agent

Repository files navigation

Data GraphQL Agent

MCP (Model Context Protocol) agent that generates production-ready Apollo GraphQL servers from BigQuery SQL queries with Dataplex lineage tracking.

Features

  • 🚀 Auto-generate Apollo GraphQL Servers from BigQuery queries
  • 📊 BigQuery Integration with type inference from SQL schemas
  • 📝 Dataplex Lineage Tracking for end-to-end data governance
  • 🐳 Docker Support for containerized deployments
  • 🧪 Test Client Generation for API validation
  • 🔌 MCP Protocol for seamless integration with Cursor and other AI assistants

How It Works

End-to-End Flow

1. Input         →  2. Schema Inference  →  3. Code Generation  →  4. Validation  →  5. Output
BigQuery SQL        Dry-run Analysis        Jinja2 Templates       Multi-level      GCS/Local
Queries             Type Mapping            Apollo Server v4       Checks           Files

Detailed Steps:

  1. Input: You provide BigQuery SQL queries via MCP tool
  2. Schema Inference: Agent runs BigQuery dry-run to infer result types
  3. Code Generation: Generates complete Apollo Server project with templates
  4. Validation (optional): Validates generated code at selected level
  5. Output: Writes validated code to GCS or local filesystem
  6. Deployment: You run the generated Node.js application

Validation Levels

Choose validation thoroughness based on your needs:

Level Time Coverage Checks Use Case
Quick ~1s 80% GraphQL syntax, SQL dry-run, file structure Rapid iteration, development
Standard ~10s 95% Quick + TypeScript compilation, imports Default, balanced approach
Full ~60s 99% Standard + Docker build, server startup, health check Pre-production, CI/CD

Architecture

The agent generates a complete TypeScript/Node.js project with:

  • Apollo Server v4 - GraphQL API server with plugins and context
  • Type-safe resolvers - Auto-generated from BigQuery schemas
  • Dataplex integration - Runtime lineage event tracking
  • Error handling - Production-safe error formatting
  • Docker configuration - Multi-stage builds for production
  • Test suite - Integration tests and test client

Installation

Prerequisites

  • Python 3.10-3.12
  • Poetry (Python dependency management)
  • Google Cloud account with BigQuery access

Setup

# Clone the repository
git clone https://github.com/opendedup/data-graphql-agent.git
cd data-graphql-agent

# Install dependencies
poetry install

# Configure environment variables
cp .env.example .env
# Edit .env with your GCP credentials

Configuration

Create a .env file or set environment variables:

# GCP Configuration
GCP_PROJECT_ID=your-project-id
GCP_LOCATION=us-central1

# Output Configuration
GRAPHQL_OUTPUT_DIR=gs://your-bucket/graphql-server
# Or local path: GRAPHQL_OUTPUT_DIR=/path/to/output

# MCP Server Configuration
MCP_TRANSPORT=stdio  # or http
MCP_HOST=0.0.0.0
MCP_PORT=8080

Usage

As MCP Server (Recommended)

Configure in Cursor's mcp.json:

{
  "mcpServers": {
    "data-graphql-agent": {
      "command": "poetry",
      "args": ["run", "python", "-m", "data_graphql_agent.mcp"],
      "cwd": "/path/to/data-graphql-agent",
      "env": {
        "GCP_PROJECT_ID": "your-project",
        "GRAPHQL_OUTPUT_DIR": "gs://your-bucket/graphql-server"
      }
    }
  }
}

Direct Python Usage

from data_graphql_agent.generation import ProjectGenerator
from data_graphql_agent.clients import StorageClient
from data_graphql_agent.models import QueryInput

# Define queries
queries = [
    QueryInput(
        query_name="trendingItems",
        sql="SELECT item, SUM(sales) as total FROM `project.dataset.sales` GROUP BY item",
        source_tables=["project.dataset.sales"]
    )
]

# Generate project
generator = ProjectGenerator(project_id="your-project")
files = generator.generate_project("my-project", queries)

# Write to storage
storage = StorageClient(project_id="your-project")
manifests = storage.write_files("gs://bucket/output", files)

Running as HTTP Server

# Set transport to HTTP
export MCP_TRANSPORT=http
export MCP_PORT=8080

# Start server
poetry run python -m data_graphql_agent.mcp

Then call tools via HTTP:

curl -X POST http://localhost:8080/mcp/call-tool \
  -H "Content-Type: application/json" \
  -d '{
    "name": "generate_graphql_api",
    "arguments": {
      "queries": [...],
      "project_name": "my-project"
    }
  }'

MCP Tools

generate_graphql_api

Generates a complete Apollo GraphQL Server project with validation.

Input:

  • queries: Array of query objects with queryName, sql, and source_tables
  • project_name: Project name for lineage tracking
  • output_path: Optional output location (defaults to GRAPHQL_OUTPUT_DIR)
  • validation_level: Optional validation thoroughness - "quick", "standard" (default), or "full"
  • auto_fix: Optional boolean to attempt automatic error fixes (default: false)

Output:

  • Complete TypeScript/Node.js project
  • Docker configuration
  • Test client
  • Integration tests
  • Validation results with checks passed and warnings

Example with Validation:

result = await handle_generate_graphql_api({
    "queries": [
        {
            "queryName": "salesByRegion",
            "sql": "SELECT region, SUM(amount) as total FROM `project.dataset.sales` GROUP BY region",
            "source_tables": ["project.dataset.sales"]
        }
    ],
    "project_name": "analytics-api",
    "output_path": "./output",
    "validation_level": "standard",  # Quick validation for speed
    "auto_fix": false
})

Success Response:

{
  "success": true,
  "output_path": "./output",
  "files_generated": [...],
  "message": "Successfully generated and validated Apollo GraphQL Server with 1 queries. Generated 15 files at ./output. Validation: 5 checks passed in 8.2s"
}

Validation Failure Response:

{
  "success": false,
  "output_path": "./output",
  "files_generated": [],
  "message": "Code validation failed at standard level",
  "error": "Validation errors: Invalid SQL in query 'salesByRegion': Table not found; TypeScript compilation failed"
}

validate_graphql_schema

Validates a GraphQL schema file.

Input:

  • schema_path: Path to schema file

Output:

  • Validation results with errors and warnings

Generated Project Structure

graphql-server/
├── src/
│   ├── server.ts          # Main Apollo Server
│   ├── typeDefs.ts        # GraphQL schema
│   ├── resolvers.ts       # Query resolvers
│   └── lineage.ts         # Dataplex integration
├── test-client/           # Test client
├── tests/                 # Integration tests
├── package.json
├── tsconfig.json
├── Dockerfile
└── docker-compose.yml

Running Generated Server

cd output/graphql-server

# Install dependencies
npm install

# Development mode
npm run dev

# Production build
npm run build
npm start

# Docker
docker-compose up --build

Development

Running Tests

# Run all tests
poetry run pytest

# Run unit tests only
poetry run pytest tests/unit

# Run with coverage
poetry run pytest --cov=data_graphql_agent

Code Formatting

# Format with Black
poetry run black src tests

# Lint with Ruff
poetry run ruff check src tests

BigQuery Type Mapping

The agent automatically maps BigQuery types to GraphQL types:

BigQuery Type GraphQL Type
STRING String
INT64 Int
FLOAT64 Float
BOOL Boolean
TIMESTAMP/DATE String (ISO 8601)
STRUCT Custom Object Type
ARRAY [Type]

Nested structures (STRUCTs and ARRAYs) are fully supported with automatic type generation.

Validation Benefits

Why Validate Before Writing?

  1. Catch errors early - Invalid SQL, type mismatches, and syntax errors detected before deployment
  2. Faster iteration - No manual debugging of generated code
  3. Confidence - Know your code will work before running npm install
  4. Cost savings - Avoid wasted GCS writes and Docker builds for broken code
  5. CI/CD friendly - Use full validation in pipelines for guaranteed deployments

When to Use Which Level?

Quick Validation (~1s)

  • ✅ Rapid prototyping and experimentation
  • ✅ Iterating on SQL queries
  • ✅ Testing query-to-schema mappings
  • ❌ Not for production deployments

Standard Validation (~10s) - Recommended Default

  • ✅ Normal development workflow
  • ✅ Before committing to version control
  • ✅ Balanced speed and thoroughness
  • ✅ Most common use case

Full Validation (~60s)

  • ✅ Pre-production deployments
  • ✅ CI/CD pipelines
  • ✅ Critical production updates
  • ✅ When Docker compatibility is essential
  • ❌ Too slow for rapid iteration

Data Lineage

The generated GraphQL server automatically tracks data lineage in Google Cloud Dataplex:

  • Process: Each resolver is registered as a process
  • Run: Each query execution creates a run (with unique request ID)
  • Lineage Events: Link BigQuery sources to BI report targets
  • Cleanup: Graceful shutdown removes lineage processes

Lineage operations are asynchronous (fire-and-forget) and don't block API responses.

License

Apache 2.0 - See LICENSE for details

Contributing

Contributions are welcome! Please submit pull requests or open issues for bugs and feature requests.

About

MCP agent that generates Apollo GraphQL servers from BigQuery queries with Dataplex lineage tracking

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published