Sonnet Scripts is a collection of pre-built data architecture patterns that you can quickly spin up on a local machine, along with examples of real-world data that you can use with it.
One of the challenges of making content and tutorials on data is the lack of established data infrastructure and real-world datasets. I have often found myself repeating this process over and over again, therefore we decided to create an open-source repo to expedite this process.
According to the Academy of American Poets, a "...sonnet is a fourteen-line poem written in iambic pentameter, employing one of several rhyme schemes, and adhering to a tightly structured thematic organization." Through the constraints of a particular sonnet format, poets throughout centuries have pushed their creativity to express themselves-- William Shakespear being one of the most well-known. I've similarly seen data architectures fill the same role as a sonnet, where their specific patterns push data practioners to think of creative ways to solve business problems.
Welcome to Sonnet Scripts β a fully containerized environment designed for data analysts, analytics engineers, and data engineers to experiment with databases, queries, and ETL pipelines. This repository provides a pre-configured sandbox where users can ingest data, transform it using SQL/Python, and test integrations with PostgreSQL, DuckDB, MinIO and more!
This project is ideal for:
- Data Engineers who want a lightweight environment for testing data pipelines.
- Analytics Engineers experimenting with dbt and SQL transformations.
- Data Analysts looking for a structured PostgreSQL + DuckDB setup.
- Developers working on data APIs using Python.
Before setting up the environment, ensure you have the following installed:
-
Docker & Docker Compose
-
Make (for automation)
- Linux/macOS: Comes pre-installed
- Windows: Install via Chocolatey β
choco install make
-
Python (3.12+)
git clone https://github.com/onthemarkdata/sonnet-scripts.git
cd sonnet-scripsmake setupThis will:
- Build the Docker images
- Start the PostgreSQL, DuckDB, and other containers
- Ensure dependencies are installed
make load-dbmake verify-dbmake testmake exec-pythonbasemake exec-postgresmake exec-duckdbmake exec-pipelinebasemake load-db-postgres-to-minioThis command:
- Exports a sample of data from PostgreSQL to CSV
- Transfers the CSV to the pipelinebase container
- Converts the CSV to Parquet and uploads to MinIO
- Cleans up temporary files
make load-db-minio-to-duckdbmake check-miniomake check-duckdbmake run-all-data-pipelinesThis runs the entire ETL process from PostgreSQL to MinIO to DuckDB.
make stopmake rebuildmake rebuild-cleanThis removes all containers, volumes, and images before rebuilding from scratch.
make statusmake logsFor a specific container: make logs c=container_name
π sonnet-scripts
βββ π pythonbase/ # Python-based processing container
βββ π pipelinebase/ # ETL pipeline and data ingest container
βββ π linuxbase/ # Base container for Linux dependencies
βββ π jupyterbase/ # Jupyter container for analytics and data science
βββ π³ docker-compose.yml # Container orchestration
βββ π Makefile # Automation commands
βββ π README.md # You are here!Github Actions automates builds, test, and environment validation. The pipeline:
- Builds Docker images (
pythonbase,linuxbase) - Starts all services using
docker compose - Runs unit & integration tests (
make test) - Shuts down containers after test pass.
- Push to
mainorfeature/* - Pull Requests to
main
Want to improve Sonnet Scripts? Here's how:
- Fork the repository
- Make your changes and test them locally
- Submit a pull request (PR) for review
For major changes, please open an issue first to discuss your proposal.
We follow Conventional Commits for all commit messages.
Maintained by:
-
[Juan Pablo Urrutia] GitHub: jpurrutia LinkedIn: Juan Pablo Urrutia
-
[Mark Freeman] GitHub: onthemarkdata LinkedIn:Mark Freeman II
If you have questions or encounter issues, feel free to:
- Open a GitHub issue
- Contact directly via LinkedIn
- COMING SOON: Join our Discord community
π Happy data wrangling!
