|
| 1 | +# SSH Authentication Failure Bug - #001 |
| 2 | + |
| 3 | +**Date Resolved:** July 4, 2025 |
| 4 | +**Status:** ✅ Resolved |
| 5 | +**Impact:** High - Blocked VM access completely |
| 6 | +**Root Cause:** YAML document start marker (`---`) breaking cloud-init parsing |
| 7 | + |
| 8 | +## Problem Summary |
| 9 | + |
| 10 | +The full cloud-init configuration (`user-data.yaml.tpl`) for the Torrust Tracker |
| 11 | +Demo VM was causing SSH authentication failures for both SSH key and password |
| 12 | +authentication, preventing users from accessing deployed VMs. |
| 13 | + |
| 14 | +## Root Cause |
| 15 | + |
| 16 | +The issue was caused by using the YAML document start marker (`---`) at the |
| 17 | +beginning of the cloud-init configuration file instead of the required |
| 18 | +`#cloud-config` header. This caused cloud-init to misprocess the entire |
| 19 | +configuration, resulting in: |
| 20 | + |
| 21 | +- Empty SSH authorized_keys (SSH key variable not templated) |
| 22 | +- Broken password authentication setup |
| 23 | +- Schema validation errors in cloud-init |
| 24 | + |
| 25 | +## The Fix |
| 26 | + |
| 27 | +**Simple but Critical Change:** |
| 28 | + |
| 29 | +```yaml |
| 30 | +# BEFORE (BROKEN): |
| 31 | +--- |
| 32 | +# cloud-config |
| 33 | + |
| 34 | +# AFTER (FIXED): |
| 35 | +#cloud-config |
| 36 | +``` |
| 37 | + |
| 38 | +**File Changed:** `infrastructure/cloud-init/user-data.yaml.tpl` |
| 39 | + |
| 40 | +## Investigation Process |
| 41 | + |
| 42 | +This bug was resolved through systematic incremental testing: |
| 43 | + |
| 44 | +1. **Incremental Testing**: Created 15+ test configurations, adding features one by one |
| 45 | +2. **Root Cause Isolation**: Compared working vs. broken configurations using diff analysis |
| 46 | +3. **Hypothesis Formation**: Identified YAML header as the key difference |
| 47 | +4. **Validation**: Deployed fresh VM with corrected header and confirmed fix |
| 48 | + |
| 49 | +## Validation Results |
| 50 | + |
| 51 | +After applying the fix: |
| 52 | + |
| 53 | +- ✅ SSH Key Authentication: Works perfectly |
| 54 | +- ✅ Password Authentication: Works perfectly |
| 55 | +- ✅ All Cloud-Init Features: Docker, UFW, packages, etc. - ALL WORKING |
| 56 | +- ✅ Integration Tests: Complete test suite passes |
| 57 | +- ✅ Make Commands: Standard workflow (`make init`, `make plan`, `make apply`) works |
| 58 | + |
| 59 | +## Files in This Directory |
| 60 | + |
| 61 | +### Core Documentation |
| 62 | + |
| 63 | +- `SSH_BUG_ANALYSIS.md` - Initial analysis and hypothesis formation |
| 64 | +- `SSH_BUG_SUMMARY.md` - Complete investigation summary with detailed timeline |
| 65 | + |
| 66 | +### Test Artifacts |
| 67 | + |
| 68 | +- `test-configs/` - All 16 test configurations used during incremental testing |
| 69 | + - `user-data-test-1.1.yaml.tpl` through `user-data-test-15.1.yaml.tpl` |
| 70 | + - `user-data-test-header.yaml.tpl` - Final test that confirmed the fix |
| 71 | + |
| 72 | +### Validation |
| 73 | + |
| 74 | +- `validation/` - (Currently empty, reserved for future validation scripts) |
| 75 | + |
| 76 | +## Lessons Learned |
| 77 | + |
| 78 | +1. **Cloud-init requires specific headers**: `#cloud-config` is mandatory, not `---` |
| 79 | +2. **Incremental testing is powerful**: Systematic approach isolated the issue effectively |
| 80 | +3. **Template variable validation**: Always verify that template variables are being substituted correctly |
| 81 | +4. **Integration testing is crucial**: End-to-end testing revealed the full scope of the issue |
| 82 | + |
| 83 | +## Prevention |
| 84 | + |
| 85 | +To prevent similar issues: |
| 86 | + |
| 87 | +- Always use `#cloud-config` as the first line in cloud-init files |
| 88 | +- Test template variable substitution in terraform plans |
| 89 | +- Run integration tests after any cloud-init configuration changes |
| 90 | +- Use the documented make workflow for deployments |
| 91 | + |
| 92 | +## Related Issues |
| 93 | + |
| 94 | +This fix resolves SSH access problems that were preventing users from following |
| 95 | +the integration testing guide and deploying the Torrust Tracker Demo |
| 96 | +successfully. |
| 97 | + |
| 98 | +## Technical Details |
| 99 | + |
| 100 | +For complete technical details, debugging methodology, and step-by-step |
| 101 | +investigation process, see: |
| 102 | + |
| 103 | +- [SSH_BUG_ANALYSIS.md](SSH_BUG_ANALYSIS.md) - Initial investigation |
| 104 | +- [SSH_BUG_SUMMARY.md](SSH_BUG_SUMMARY.md) - Comprehensive analysis with timeline |
0 commit comments