Automated Service Level Objective (SLO) testing for YDB database SDKs with chaos engineering and performance monitoring built-in.
YDB SLO Action helps you test your YDB SDK's reliability under real-world conditions. Instead of just running tests against a perfect database, this action:
- 🚀 Deploys a full YDB cluster (1 storage + 3 database nodes)
- 💥 Introduces chaos (random node failures, network issues, etc.)
- 📊 Collects metrics via Prometheus during your tests
- 📈 Generates reports comparing performance with your base branch
- 💬 Posts results directly to your PR for easy review
Think of it as a way to answer: "Will my SDK handle production issues gracefully?"
Add this to your GitHub Actions workflow:
name: SLO Test
on: pull_request
jobs:
test:
runs-on: ubuntu-latest
steps:
# Deploy YDB cluster with chaos testing
- uses: ydb-platform/ydb-slo-action/init@v1
with:
workload_name: my-sdk-test
github_token: ${{ secrets.GITHUB_TOKEN }}
# Run your SDK tests
- name: Run workload
run: ./scripts/slo-test.sh
report:
needs: test
runs-on: ubuntu-latest
steps:
# Generate and post performance report
- uses: ydb-platform/ydb-slo-action/report@v1
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
github_run_id: ${{ github.run_id }}That's it! The action handles infrastructure, chaos injection, metrics collection, and reporting automatically.
1. init action (runs before your tests):
- Deploys YDB cluster using Docker Compose
- Starts Prometheus for metrics collection
- Launches chaos monkey that randomly introduces failures
- Saves state for later cleanup
2. report action (runs after your tests):
- Collects metrics from Prometheus
- Fetches metrics from your base branch for comparison
- Renders a beautiful report with ASCII charts
- Updates PR comment with results (one comment per workload)
While your SDK tests run, the chaos monkey randomly:
- Stops nodes gracefully or with SIGKILL
- Pauses containers (simulating freezes)
- Introduces network black holes
- Performs rolling restarts
Your tests should handle these scenarios gracefully. The metrics show how well your SDK copes with failures.
Want to track your own Prometheus queries? Provide custom metrics:
- uses: ydb-platform/ydb-slo-action/init@v1
with:
workload_name: my-test
github_token: ${{ secrets.GITHUB_TOKEN }}
metrics_yaml: |
- name: my_custom_metric
query: rate(http_requests_total[5m])
step: 15sFork this repo and add your own chaos scripts to deploy/chaos/scenarios/. See existing scenarios for examples.
Welcome! Here's how to start contributing to this project.
- Bun (package manager): Install Bun
- Docker (for local testing)
- Basic understanding of TypeScript and GitHub Actions
# Clone and install dependencies
git clone https://github.com/ydb-platform/ydb-slo-action.git
cd ydb-slo-action
bun install- Make your changes in
init/orreport/directories - Build the action to verify everything works:
bun run bundle
- Commit your changes — husky will automatically:
- Run linting and formatting
- Rebuild
dist/directory - Stage the rebuilt files
Important: Never edit files in dist/ manually! They're auto-generated.
You can test the infrastructure locally:
cd deploy
docker compose up -dThis starts:
- YDB cluster (1 storage + 3 database nodes)
- Prometheus on port 9090
- Chaos monkey injecting faults
Stop everything with:
docker compose downWe use automated formatting, so you don't need to worry about style. Just follow these conventions:
- Import with
.jsextensions:import { x } from './module.js'(ESM requirement) - Use
node:prefix:import * as fs from 'node:fs' - Prefer
letoverconst(project convention)
Run linting and formatting manually:
bun run lint # Fix linting issues
bun run format # Format codeWe use emoji-based commit messages for easy scanning:
✨ Add custom metrics support
Users can now provide custom Prometheus queries via the metrics_yaml
input parameter. This allows tracking SDK-specific metrics without
forking the action.
Emoji guide:
- ✨ New feature
- 🐛 Bug fix
- 📝 Documentation
- ♻️ Refactoring
- 🔧 Configuration/build changes
- 🐳 Docker-related changes
- 🧪 Tests
- 🚀 CI/CD changes
Rules:
- Use imperative mood ("Add" not "Added")
- Capitalize after emoji
- No period at end of subject line
- Explain WHAT and WHY in the body (not HOW)
Understanding the project structure will help you contribute effectively.
Actions are split into lifecycle files (main.ts, post.ts) that orchestrate, and utility modules (lib/) that do the heavy lifting. This prevents monolithic files and makes testing easier.
Everything is defined declaratively:
- Docker Compose for services
- YAML for metrics
- Shell scripts for chaos scenarios
This means users can extend functionality without understanding TypeScript.
The init action saves metrics as GitHub Artifacts, and the report action downloads them later. This decouples the actions and allows flexible workflow design.
Users customize behavior through inputs and config files, not code changes. This lowers the barrier to adoption.
init/
├── main.ts # Entry point (deploys infrastructure)
├── post.ts # Cleanup (collects metrics, uploads artifacts)
└── lib/ # Utility modules (docker, prometheus, github, etc.)
report/
├── main.ts # Entry point (generates and posts report)
└── lib/ # Utility modules (workflow, metrics, charts, etc.)
deploy/
├── compose.yml # Docker Compose definition
├── metrics.yaml # Default Prometheus queries
├── ydb/
│ ├── Dockerfile # YDB node image
│ └── rootfs/ # Files copied to container root (/)
└── chaos/
├── Dockerfile # Chaos monkey image
└── rootfs/ # Files copied to container root (/)
dist/ # Auto-generated (don't edit!)
We use the rootfs pattern for organizing Docker images (inspired by Bitnami containers):
-
Each service directory (e.g.,
ydb/,chaos/) contains:Dockerfile— image definitionrootfs/— directory structure as it will appear in the container
-
In the Dockerfile,
COPY rootfs /copies the entirerootfs/content to the container's root filesystem -
Example:
deploy/chaos/rootfs/opt/ydb.tech/scripts/chaos/libchaos.shbecomes/opt/ydb.tech/scripts/chaos/libchaos.shin the container
Why this pattern? It makes the file structure explicit and easy to navigate. You can see exactly what files will be in the container by looking at the rootfs/ directory. This approach is widely used by Bitnami and improves maintainability.
┌──────────────┐
│ init action │
│ (main.ts) │ ← Deploys YDB cluster, starts chaos, saves state
└──────┬───────┘
│
↓
┌──────────────┐
│ User tests │ ← Your SDK tests run here
└──────┬───────┘
│
↓
┌──────────────┐
│ init action │
│ (post.ts) │ ← Collects metrics, uploads as artifacts
└──────┬───────┘
│
↓
┌──────────────┐
│report action │ ← Downloads artifacts, generates report, posts to PR
└──────────────┘
The init action uses GitHub Actions' pre/post pattern:
main.tsruns before user workload- User's test scripts run
post.tsruns after (even if tests fail)
This ensures cleanup and metrics collection always happen.
Data flows from main.ts to post.ts using GitHub Actions' saveState() and getState() APIs. We save:
- Working directory path
- Workload name
- PR number
- Start timestamp
- Define metrics as YAML (name, PromQL query, step)
- Parse YAML at runtime
- Query Prometheus API
- Serialize as JSONL (one JSON object per line)
Why JSONL? Easier to append, process line-by-line, and less memory-intensive than JSON arrays.
- Download current run's metrics from artifacts
- Fetch latest successful base branch run
- Download base branch metrics
- Merge both datasets (current first, base second)
- Render comparison with ASCII charts
Why not use a database? Keeps the action stateless and doesn't require external services.
Chaos scenarios are simple shell scripts. Here's a template:
#!/bin/sh
set -e # Fail fast
# Load helper functions
. /opt/ydb.tech/scripts/chaos/libchaos.sh
echo "Scenario: Your description"
# Select a random target
nodeForChaos=$(get_random_database_node)
echo "Selected node: ${nodeForChaos}"
# Your chaos logic (e.g., docker stop, pause, network manipulation)
docker stop "${nodeForChaos}" -t 30
sleep 5
docker start "${nodeForChaos}"
echo "Scenario completed"Naming convention: NN-descriptive-name.sh (e.g., 01-graceful-stop.sh)
Helper functions available:
get_random_database_node— random database nodeget_random_storage_node— random storage nodeget_random_node— any random YDB nodelog "message"— timestamped logging
Golden rules:
- Always restore to healthy state — don't leave the system broken
- Use randomization — avoid predictable patterns
- Add logging — use
echostatements for observability
Check out existing scenarios in deploy/chaos/scenarios/:
01-graceful-stop.sh— stops a node gracefully, then restarts03-sigkill.sh— sends SIGKILL to a node06-ip-blackhole.sh— simulates DNS cache poisoning
Never edit dist/ manually! It's auto-generated by the bundler. When you commit source changes, husky automatically rebuilds dist/ and stages it for you.
Why? GitHub Actions can only run JavaScript, not TypeScript. We bundle TypeScript into optimized JavaScript in dist/.
TypeScript ESM requires .js extensions in import paths, even though files are .ts:
// ✅ Correct
import { func } from './module.js'
// ❌ Wrong (will fail at runtime)
import { func } from './module'This trips up many developers! It's a TypeScript ESM requirement, not our choice.
Always run Docker Compose commands with cwd set to the directory containing compose.yml. Docker resolves relative paths based on working directory.
Use the pattern {workload}-{type}.{extension}:
my-workload-metrics.jsonlmy-workload-logs.txtmy-workload-pull.txt
This prevents conflicts when multiple workloads run in the same workflow.
The action only needs these permissions:
- Read PR information
- Upload/download artifacts
- Post PR comments
Always use secrets.GITHUB_TOKEN provided by GitHub Actions (minimum permissions).
The chaos monkey has privileged access to the Docker socket. This means chaos scripts can manipulate any container. Review scripts carefully before adding them.
Artifacts may contain sensitive logs and metrics. Ensure your repository access controls match your data sensitivity.
Set this in your workflow to see debug logs:
env:
ACTIONS_STEP_DEBUG: trueThe action copies deploy/ to .slo/ in the working directory:
cd .slo
docker compose logsGet Prometheus container IP and query directly:
docker inspect prometheus | grep IPAddress
curl http://<ip>:9090/api/v1/query?query=upDownload artifacts from the GitHub Actions UI to inspect raw data:
- Metrics: JSONL format (one JSON object per line)
- Logs: Plain text
- Events: JSONL format
We welcome contributions! Before submitting a PR, please:
- Read this README thoroughly
- Check out
CONTRIBUTING.txtfor the Yandex CLA details - Make sure your changes follow our code style
- Test locally with
docker compose up - Ensure
bun run bundlecompletes without errors
External contributors must agree to the Yandex CLA before we can merge PRs.
This project is licensed under the Apache License 2.0. See LICENSE for details.
Questions? Open an issue or reach out to the maintainers. We're happy to help!