hardbyte · hardbyte · Nov 20, 2024 · Nov 20, 2024 · Oct 13, 2025 · Oct 14, 2025
diff --git a/.typos.toml b/.typos.toml
@@ -1,9 +1,9 @@
 [default]
-locale = "en-au"
+locale = "en-us"
 
-[default.extend-identifiers]
-center = "center" # Due to CSS usage
-authorization = "authorization" # due to rbac.authorization.k8s.io usage
+[default.extend-words]
+# Pydantic uses British spelling for this method name
+customise = "customise"
 
 [files]
 ignore-vcs = true

diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,169 @@
+## Project Overview
+
+Netchecks is a cloud native tool for testing network conditions and asserting that they meet expectations. The repository contains two main components:
+
+1. **Netcheck CLI/Python Library** - Command line tool and Python library for running network checks (DNS, HTTP) with customizable validation rules written in CEL (Common Expression Language)
+2. **Netchecks Operator** - Kubernetes Operator that schedules and runs network checks, reporting results as PolicyReport resources
+
+## Architecture
+
+### Netcheck CLI (Python Library)
+- **Entry point**: `netcheck/cli.py` - Uses Typer for CLI interface with commands `run`, `http`, `dns`
+- **Core logic**: `netcheck/runner.py` - Contains `run_from_config()` for running multiple assertions and `check_individual_assertion()` for individual tests
+- **Check implementations**: `netcheck/checks/` - Separate modules for DNS (`dns.py`), HTTP (`http.py`), and internal checks
+- **Validation**: `netcheck/validation.py` - CEL expression evaluation for custom validation rules
+- **Context system**: `netcheck/context.py` - Template replacement using external data from files, inline data, or directories (with lazy loading via `LazyFileLoadingDict`)
+
+Test results include `spec` (test configuration), `data` (results), and `status` (pass/fail). Custom validation rules can reference both `data` and `spec` in CEL expressions.
+
+### Netchecks Operator (Kubernetes)
+- **Main operator**: `operator/netchecks_operator/main.py` - Kopf-based operator with handlers for NetworkAssertion CRD lifecycle
+- **Configuration**: `operator/netchecks_operator/config.py` - Settings loaded from environment variables
+- **Flow**:
+  1. NetworkAssertion CRD created → operator creates ConfigMap with rules + Job/CronJob with probe pod
+  2. Probe pod runs netcheck CLI with mounted config
+  3. Operator daemon monitors probe pod completion
+  4. Results extracted from pod logs and transformed into PolicyReport CRD
+  5. Prometheus metrics updated with test duration and results
+- **Helm chart**: `operator/charts/netchecks/` - Includes NetworkAssertion and PolicyReport CRDs
+
+Key transformation: K8s concepts (ConfigMap/Secret contexts) are mapped to CLI format (directory/file/inline contexts) via `transform_context_for_config_file()` at operator/netchecks_operator/main.py:208.
+
+## Development Setup
+
+The project uses **uv** for Python dependency management (not pip or poetry for the CLI).
+
+### CLI/Library Development
+
+Install dependencies:
+```bash
+uv sync
+```
+
+Run tests with coverage:
+```bash
+uv run pytest tests --cov netcheck --cov-report=lcov --cov-report=term
+```
+
+Run a single test:
+```bash
+uv run pytest tests/test_cli.py::test_name -v
+```
+
+Run the CLI locally:
+```bash
+uv run netcheck dns --host github.com -v
+uv run netcheck http --url https://github.com/status -v
+uv run netcheck run --config example-config.json -v
+```
+
+### Operator Development
+
+The operator uses **Poetry** for dependency management (separate from CLI).
+
+Install operator dependencies:
+```bash
+cd operator
+poetry install --with dev
+```
+
+Run operator tests (requires running Kubernetes cluster):
+```bash
+cd operator
+pytest -v
+```
+
+Integration tests require:
+- Kind cluster with Cilium CNI (see `.github/workflows/ci.yaml:269-282`)
+- PolicyReport CRD installed
+- Netcheck operator and probe images loaded into cluster
+
+Run operator locally (outside cluster):
+```bash
+cd operator
+poetry run kopf run netchecks_operator/main.py --liveness=http://0.0.0.0:8080/healthz
+```
+
+### Docker Build
+
+Build probe image:
+```bash
+docker build -t ghcr.io/hardbyte/netchecks:main .
+```
+
+Build operator image:
+```bash
+docker build -t ghcr.io/hardbyte/netchecks-operator:main operator/
+```
+
+### Code Quality
+
+Format code with ruff (CLI uses uv, operator uses poetry):
+```bash
+# CLI
+uv run ruff format .
+
+# Operator
+cd operator
+poetry run ruff format .
+```
+
+Lint with ruff:
+```bash
+# CLI
+uv run ruff check .
+
+# Operator
+cd operator
+poetry run ruff check .
+```
+
+## Testing Philosophy
+
+- **Unit tests** in `tests/` test the CLI and library functions
+- **Integration tests** in `operator/tests/` deploy NetworkAssertion resources to a real Kubernetes cluster and verify PolicyReport results
+- CI runs tests on ubuntu-latest, windows-latest, and macOS-13 with Python 3.11 and 3.12
+
+## Key Configuration Files
+
+- `pyproject.toml` - CLI package metadata and dependencies (using uv)
+- `operator/pyproject.toml` - Operator package metadata and dependencies (using poetry)
+- `operator/charts/netchecks/values.yaml` - Helm chart configuration
+- `operator/manifests/deploy.yaml` - Static Kubernetes manifests
+
+## Release Process
+
+1. Update version in `pyproject.toml` (CLI) and `operator/pyproject.toml` (operator)
+2. Push to main branch
+3. Create GitHub release
+4. CI automatically:
+   - Publishes package to PyPI
+   - Builds and pushes Docker images to ghcr.io
+   - Runs integration tests with Kind + Cilium
+
+## CEL Validation Examples
+
+Default DNS validation rule:
+```cel
+data['response-code'] == 'NOERROR' &&
+size(data['A']) >= 1 &&
+(timestamp(data['endTimestamp']) - timestamp(data['startTimestamp']) < duration('10s'))
+```
+
+Default HTTP validation rule:
+```cel
+data['status-code'] in [200, 201]
+```
+
+Custom validation with JSON parsing:
+```cel
+parse_json(data.body).headers['X-Header'] == 'expected-value'
+```
+
+## Important Notes
+
+- Operator uses Kopf framework for handling Kubernetes CRD lifecycle events
+- Template strings in NetworkAssertion specs use `{{ variable }}` syntax and are replaced via `netcheck/context.py:replace_template()`
+- Sensitive fields (headers) are redacted from output unless `--disable-redaction` flag is used
+- PolicyReport CRD must be installed before operator (from wg-policy-prototypes)
+- Operator metrics exposed on port 9090 (configurable) using OpenTelemetry + Prometheus
diff --git a/Dockerfile b/Dockerfile
@@ -12,7 +12,7 @@
    uv sync --frozen --no-dev --no-install-project

 # Copy the application code to the build stage
 ADD . /app

 RUN --mount=type=cache,target=/root/.cache \
    uv sync --frozen --no-dev
@@ -24,7 +24,7 @@
 # Set environment variables for Python
 ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONUNBUFFERED=1 \
-#    VIRTUAL_ENV=/app/.venv \
+    VIRTUAL_ENV=/app/.venv \
     PATH="/app/.venv/bin:$PATH" \
     USERNAME=netchecks \
     USER_UID=1000 \

diff --git a/README.md b/README.md
@@ -254,25 +254,16 @@ Kubernetes operator to inject data.
 Update version in `pyproject.toml`, push to `main` and create a release on GitHub. Pypi release will be carried
 out by GitHub actions. 
 
-Install dev dependencies with Poetry:
+Install dev dependencies with `uv`:
 
 ```shell
-poetry install --with dev
-```
-
-### Manual Release 
-To release manually, use Poetry:
-
-```shell
-poetry version patch
-poetry build
-poetry publish
+uv sync
 ```
 
 ### Testing
 
 Pytest is used for testing. 
 
 ```shell
-poetry run pytest
+uv run pytest
 ```
diff --git a/docs/README.md b/docs/README.md
@@ -19,7 +19,7 @@ npm run dev
 
 Finally, open [http://localhost:3000](http://localhost:3000) in your browser to view the website.
 
-## Customising
+## Customizing
 
 You can start editing this template by modifying the files in the `/src` folder. The site will auto-update as you edit these files.
 

diff --git a/docs/src/pages/docs/alerting.md b/docs/src/pages/docs/alerting.md
@@ -51,7 +51,7 @@ policy reports a failure status. The alert includes details about the failing po
 
 ## Configure Alerting in Grafana
 
-Grafana can visualise the alerts generated by Prometheus and also send notifications through various channels such as email, Slack, or PagerDuty. Grafana alert can be configured manually via the UI, or via a configuration file.
+Grafana can visualize the alerts generated by Prometheus and also send notifications through various channels such as email, Slack, or PagerDuty. Grafana alert can be configured manually via the UI, or via a configuration file.
 
 ### Configure Alerting via UI
 

diff --git a/docs/src/pages/docs/dns.md b/docs/src/pages/docs/dns.md
@@ -133,7 +133,7 @@ metadata:
   name: cluster-dns-should-work
   namespace: default
   annotations:
-    description: Check cluster dns behaviour
+    description: Check cluster dns behavior
 spec:
   # Every 20 minutes
   schedule: "*/20 * * * *"

diff --git a/docs/src/pages/index.md b/docs/src/pages/index.md
@@ -87,7 +87,7 @@ The `PolicyReport` contains information about the test run and the results of th
 
 {% quick-link title="Architecture guide" icon="presets" href="/" description="Learn how the internals work and contribute." /%}
 
-{% quick-link title="API reference" icon="theming" href="/" description="Learn to easily customise and modify your app's visual design to fit your brand." /%}
+{% quick-link title="API reference" icon="theming" href="/" description="Learn to easily customize and modify your app's visual design to fit your brand." /%}
 
 {% quick-link title="Examples" icon="plugins" href="/" description="See how others are using the library in their projects." /%}
 

diff --git a/netcheck/context.py b/netcheck/context.py
@@ -80,14 +80,30 @@ def __init__(self, directory, *args, **kwargs):
         self.directory = directory
         super().__init__(*args, **kwargs)
         # Pre-populate the dictionary with keys for each file in the directory
+        # Skip Kubernetes ConfigMap metadata files (symlinks starting with '..')
         for filename in os.listdir(directory):
+            # Skip hidden files and Kubernetes ConfigMap symlinks (..data, ..2025_*, etc.)
+            if filename.startswith('.'):
+                continue
             # We'll use None as a placeholder for the file contents
-            # We could strip filename extensions, but I think it is clearer not to
-            # os.path.splitext(filename)[0]
             self[filename] = None
 
     def __getitem__(self, key):
+        # Prevent path traversal attacks by checking if key contains path separators
+        if os.path.sep in key or (os.altsep and os.altsep in key) or key.startswith('.'):
+            raise KeyError(f"Invalid key: {key}. Path separators and relative paths are not allowed.")
+
         filepath = os.path.join(self.directory, key)
+
+        # Additional safety check: ensure the resolved path is within the directory
+        try:
+            filepath = os.path.realpath(filepath)
+            directory = os.path.realpath(self.directory)
+            if not filepath.startswith(directory + os.path.sep):
+                raise KeyError(f"Path traversal detected: {key}")
+        except (OSError, ValueError) as e:
+            raise KeyError(f"Invalid path: {key}") from e
+
         if super().__getitem__(key) is None and os.path.isfile(filepath):
             # If the value is None (our placeholder), replace it with the actual file contents
             with open(filepath, "rt") as f:
@@ -96,5 +112,12 @@ def __getitem__(self, key):
 
     def items(self):
         # Override items() to call __getitem__ for each key
-        # Required because CEL calls items() when converting to CEL Map type.
         return [(key, self[key]) for key in self]
+
+    def materialize(self):
+        """
+        Force load all lazy-loaded file contents and return a regular dict.
+        This is needed for compatibility with the Rust CEL library which doesn't
+        properly handle dict subclasses.
+        """
+        return {key: self[key] for key in self}
diff --git a/netcheck/runner.py b/netcheck/runner.py
@@ -49,9 +49,11 @@ def run_from_config(
             inline_context = replace_template(inline_context, context)
             context[c["name"]] = inline_context
         elif c["type"] == "directory":
-            # Return a Dict like object that lazy loads individual files
-            # from the directory (with caching) and add them to the context
-            context[c["name"]] = LazyFileLoadingDict(c["path"])
+            # Load a LazyFileLoadingDict and immediately materialize it to a regular dict
+            # The Rust CEL library doesn't properly handle dict subclasses, so we need
+            # to convert it to a regular dict before using it in CEL expressions
+            lazy_dict = LazyFileLoadingDict(c["path"])
+            context[c["name"]] = lazy_dict.materialize()
         else:
             logger.warning(f"Unknown context type '{c['type']}'")
 

diff --git a/netcheck/validation.py b/netcheck/validation.py
@@ -2,10 +2,10 @@
 import json
 import logging
 from typing import Dict
-
-import celpy
 import yaml
-from celpy import CELParseError, CELEvalError, json_to_cel
+
+from cel import cel
+
 
 logger = logging.getLogger("netcheck.validation")
 
@@ -31,35 +31,36 @@ def evaluate_cel_with_context(context: Dict, validation_rule: str):
     Raises:
         ValueError: If the CEL expression is invalid.
     """
-
-    env = celpy.Environment()
-
-    # Validate the CEL validation rule and compile to ast
-    try:
-        ast = env.compile(validation_rule)
-    except CELParseError:
-        print("Invalid CEL expression. Treating as error.")
-        raise ValueError("Invalid CEL expression")
-
-    # create the CEL program
     functions = {
-        "parse_json": lambda s: json_to_cel(json.loads(s)),
-        "parse_yaml": lambda s: json_to_cel(yaml.safe_load(s)),
+        "parse_json": lambda s: json.loads(s),
+        "parse_yaml": lambda s: yaml.safe_load(s),
         "b64decode": lambda s: base64.b64decode(s).decode("utf-8"),
         "b64encode": lambda s: base64.b64encode(s.encode()).decode(),
     }
-    prgm = env.program(ast, functions=functions)
-
-    # Set up the context
-    activation = celpy.json_to_cel(context)
+    env = cel.Context(
+        variables=context,
+        functions=functions,
+    )
 
     # Evaluate the CEL expression
     try:
-        context = prgm.evaluate(activation)
-    except CELEvalError:
-        # Note this can fail if the context is missing a key e.g. the probe
-        # failed to return a value for a key that the validation rule expects
-
+        result = cel.evaluate(validation_rule, env)
+    except ValueError as e:
+        error_msg = str(e)
+        # Distinguish between parse errors (config bugs) and execution errors (validation failures)
+        if "Failed to parse" in error_msg:
+            # Parse/syntax errors indicate invalid CEL configuration - raise to surface
+            logger.error(f"Invalid CEL expression syntax: {e}")
+            raise ValueError(f"Invalid CEL expression: {e}") from e
+        else:
+            # Execution errors (type mismatches, etc.) indicate validation failure
+            # These can happen with valid expressions that fail at runtime
+            logger.debug(f"CEL execution failed: {e}")
+            return False
+    except RuntimeError as e:
+        # Runtime errors (undefined variables) indicate validation failure
+        # This can happen if the probe failed to return expected values
+        logger.debug(f"CEL evaluation failed: {e}")
         return False
 
-    return context
+    return result