This repository demonstrates how to extract data from the GitHub REST API using dlt, and load it into Google BigQuery for analysis.
- dlt REST API Source: Connects to the GitHub API and extracts repository metadata for a specified user.
- BigQuery Integration: Stores all extracted data in a BigQuery dataset for fast analytics.
- Secrets Management: Uses
.dlt/secrets.tomlto securely store your GitHub access token, owner name, and Google Cloud service account credentials. - Automated Extraction: Can be scheduled to run periodically (e.g., via GitHub Actions or cron).
- Configure Secrets: Add your GitHub personal access token, the target owner, and your Google Cloud service account credentials to
.dlt/secrets.toml. - Run the Pipeline: Execute
python github_pipeline.pyto extract all repository data for the owner and load it into BigQuery. - Analyze Data: Query the
github_data.repositoriestable in BigQuery to explore repository metadata.
- Install dependencies:
pip install 'dlt[bigquery]' - Fill in
.dlt/secrets.toml:access_token = "<your_github_pat>" owner = "<github_username>" [credentials] project_id = "<your_gcp_project_id>" private_key = """ -----BEGIN PRIVATE KEY----- ...your private key... -----END PRIVATE KEY----- """ client_email = "<your_service_account_email>"
- Run the pipeline:
python github_pipeline.py
- Data is loaded into your BigQuery project under the dataset
github_data, tablerepositories. - All available fields from the GitHub API response are included.
- Keep your personal access token and GCP credentials secret. Do not commit
.dlt/secrets.tomlwith real credentials to public repositories.
MIT