|
| 1 | +# SSH Authentication Bug Analysis Summary |
| 2 | + |
| 3 | +**Date:** July 4, 2025 |
| 4 | +**Status:** ✅ RESOLVED - ROOT CAUSE CONFIRMED |
| 5 | + |
| 6 | +## Problem Description |
| 7 | + |
| 8 | +The full cloud-init configuration (`user-data.yaml.tpl`) for the Torrust Tracker |
| 9 | +Demo VM causes SSH authentication failures for both SSH key and password |
| 10 | +authentication. The issue manifests as: |
| 11 | + |
| 12 | +- SSH connection attempts time out or are rejected |
| 13 | +- Both SSH key authentication and password authentication fail |
| 14 | +- VM appears to be running normally (gets IP, port 22 is open, SSH daemon is |
| 15 | + running) |
| 16 | +- UFW firewall shows SSH is allowed |
| 17 | + |
| 18 | +## ROOT CAUSE IDENTIFIED AND CONFIRMED ✅ |
| 19 | + |
| 20 | +**CONFIRMED**: The YAML document start marker ("---") was causing cloud-init to |
| 21 | +process the configuration incorrectly, leading to SSH authentication failures. |
| 22 | + |
| 23 | +**EVIDENCE**: |
| 24 | + |
| 25 | +- **user-data.yaml.tpl** (BROKEN): Uses "---" as the first line → SSH |
| 26 | + authentication fails |
| 27 | +- **user-data-test-header.yaml.tpl** (FIXED): Uses "#cloud-config" as the first |
| 28 | + line → SSH authentication works perfectly |
| 29 | + |
| 30 | +**VALIDATION RESULTS**: |
| 31 | + |
| 32 | +- ✅ SSH Key Authentication: Works perfectly |
| 33 | +- ✅ Password Authentication: Works perfectly (password: torrust123) |
| 34 | +- ✅ All cloud-init features: Applied correctly (Docker, UFW, packages, etc.) |
| 35 | + |
| 36 | +**CONCLUSION**: The cloud-init parser requires "#cloud-config" as the first |
| 37 | +line, not the YAML document start marker "---". Using "---" causes the entire |
| 38 | +configuration to be misprocessed, breaking SSH setup while other features may |
| 39 | +still work partially. |
| 40 | + |
| 41 | +## Current Knowledge |
| 42 | + |
| 43 | +### Working Components (Confirmed through incremental testing) |
| 44 | + |
| 45 | +1. **Basic user setup** (`user-data-minimal.yaml.tpl`) - SSH ✅ |
| 46 | +2. **torrust user creation** (`user-data-test-1.1.yaml.tpl`) - SSH ✅ |
| 47 | +3. **Basic packages installation** (`user-data-test-2.1.yaml.tpl`) - SSH ✅ |
| 48 | +4. **SSH configuration and restart** (`user-data-test-3.1.yaml.tpl`, |
| 49 | + `user-data-test-3.2.yaml.tpl`) - SSH ✅ |
| 50 | +5. **UFW firewall configuration** (`user-data-test-5.1.yaml.tpl`) - SSH ✅ |
| 51 | +6. **System reboot** (`user-data-test-7.1.yaml.tpl`) - SSH ✅ |
| 52 | +7. **Fail2ban** (`user-data-test-8.1.yaml.tpl`) - SSH ✅ |
| 53 | +8. **Docker installation and configuration** (`user-data-test-9.1.yaml.tpl`) - SSH ✅ |
| 54 | +9. **Sysctl network optimizations** (`user-data-test-10.1.yaml.tpl`) - SSH ✅ |
| 55 | +10. **Unattended-upgrades** (`user-data-test-11.1.yaml.tpl`) - SSH ✅ |
| 56 | +11. **Torrust packages** (`user-data-test-12.1.yaml.tpl`) - SSH ✅ |
| 57 | +12. **Docker Compose V2** (`user-data-test-13.1.yaml.tpl`) - SSH ✅ |
| 58 | +13. **UFW additional rules** (`user-data-test-14.1.yaml.tpl`) - SSH ✅ |
| 59 | +14. **Docker restart** (`user-data-test-15.1.yaml.tpl`) - SSH ✅ |
| 60 | + |
| 61 | +### Suspect Components (Not yet isolated) |
| 62 | + |
| 63 | +Based on the difference between the last working config |
| 64 | +(`user-data-test-7.1.yaml.tpl`) and the full config (`user-data.yaml.tpl`), |
| 65 | +the following components are suspects: |
| 66 | + |
| 67 | +1. **fail2ban** - Could be blocking SSH connections |
| 68 | +2. **Docker installation and configuration** - Could interfere with networking |
| 69 | +3. **sysctl network optimizations** - Could affect SSH connections |
| 70 | +4. **unattended-upgrades** - Could interfere during setup |
| 71 | +5. **Docker daemon restart** - Could cause timing issues |
| 72 | + |
| 73 | +## Testing Methodology |
| 74 | + |
| 75 | +Using incremental testing approach: |
| 76 | + |
| 77 | +- Start with last known working config (`user-data-test-7.1.yaml.tpl`) |
| 78 | +- Add one suspect component at a time |
| 79 | +- Test SSH after each addition |
| 80 | +- Identify the exact component that breaks SSH |
| 81 | + |
| 82 | +## Test Results So Far |
| 83 | + |
| 84 | +| Config | Components Added | SSH Key | SSH Password | Status | |
| 85 | +| ------------ | ------------------------- | ------- | ------------ | ---------- | |
| 86 | +| minimal | ubuntu user only | ✅ | ✅ | Working | |
| 87 | +| test-1.1 | + torrust user | ✅ | ✅ | Working | |
| 88 | +| test-2.1 | + basic packages | ✅ | ✅ | Working | |
| 89 | +| test-3.1/3.2 | + SSH config/restart | ✅ | ✅ | Working | |
| 90 | +| test-5.1 | + UFW firewall | ✅ | ✅ | Working | |
| 91 | +| test-7.1 | + reboot | ✅ | ✅ | Working | |
| 92 | +| test-8.1 | + fail2ban | ✅ | ✅ | Working | |
| 93 | +| test-9.1 | + Docker | ✅ | ✅ | Working | |
| 94 | +| test-10.1 | + sysctl optimizations | ✅ | ✅ | Working | |
| 95 | +| test-11.1 | + unattended-upgrades | ✅ | ✅ | Working | |
| 96 | +| test-12.1 | + Torrust packages | ✅ | ✅ | Working | |
| 97 | +| test-13.1 | + Docker Compose V2 | ✅ | ✅ | Working | |
| 98 | +| test-14.1 | + UFW additional rules | ✅ | ✅ | Working | |
| 99 | +| test-15.1 | + Docker restart | ✅ | ✅ | Working | |
| 100 | +| **full** | + ALL COMPONENTS COMBINED | ❌ | ❌ | **BROKEN** | |
| 101 | + |
| 102 | +## CRITICAL DISCOVERY - CONFIRMED! |
| 103 | + |
| 104 | +🚨 **ALL INDIVIDUAL COMPONENTS WORK!** 🚨 |
| 105 | +✅ **FULL CONFIGURATION FAILS!** ✅ |
| 106 | + |
| 107 | +**CONFIRMATION TEST RESULTS:** |
| 108 | + |
| 109 | +- **Full Config VM IP:** 192.168.122.6 |
| 110 | +- **SSH Key Authentication:** ❌ Permission denied (publickey) |
| 111 | +- **SSH Password Authentication:** ❌ Permission denied (publickey) |
| 112 | +- **Port 22 Status:** ✅ Open and listening |
| 113 | +- **SSH Daemon:** ✅ Running |
| 114 | + |
| 115 | +This **confirms our hypothesis** that the SSH failure is NOT caused by any |
| 116 | +individual component, but rather by the combination of all components together. |
| 117 | + |
| 118 | +We have systematically tested **EVERY SINGLE COMPONENT** from the full configuration |
| 119 | +individually, and they all work perfectly. This means the SSH failure is NOT caused by |
| 120 | +any individual component, but rather by: |
| 121 | + |
| 122 | +1. **Component interactions** - Multiple components interfering with each other |
| 123 | +2. **Timing issues** - Race conditions between services during startup |
| 124 | +3. **Configuration ordering** - The sequence of operations matters |
| 125 | +4. **Cumulative effects** - The combination of all components together |
| 126 | + |
| 127 | +## Next Steps |
| 128 | + |
| 129 | +1. **Test fail2ban** - Add fail2ban package and default config to test-7.1 ✅ **PASSED** |
| 130 | +2. **Test Docker** - Add Docker installation and configuration ✅ **PASSED** |
| 131 | +3. **Test sysctl** - Add network optimizations ✅ **PASSED** |
| 132 | +4. **Test unattended-upgrades** - Add automatic updates configuration ✅ **PASSED** |
| 133 | +5. **Test Torrust packages** - Add pkg-config, libssl-dev, make, build-essential, |
| 134 | + libsqlite3-dev, sqlite3 ✅ **PASSED** |
| 135 | +6. **Test Docker Compose installation** - Add Docker Compose V2 plugin installation ✅ **PASSED** |
| 136 | +7. **Test additional UFW rules** - Add Torrust-specific firewall rules ✅ **PASSED** |
| 137 | +8. **Test Docker restart** - Add Docker daemon restart command ✅ **PASSED** |
| 138 | + |
| 139 | +## NEW INVESTIGATION STRATEGY |
| 140 | + |
| 141 | +Since all individual components work, we need to investigate: |
| 142 | + |
| 143 | +1. **Test exact full configuration** - Deploy the exact full config and debug |
| 144 | +2. **Compare configurations** - Find subtle differences between working incremental tests and full config |
| 145 | +3. **Timing analysis** - Investigate service startup timing and dependencies |
| 146 | +4. **Component interaction analysis** - Test combinations of components |
| 147 | + |
| 148 | +## Hypotheses - UPDATED AFTER DISCOVERY |
| 149 | + |
| 150 | +**ALL INDIVIDUAL COMPONENTS HAVE BEEN RULED OUT!** |
| 151 | + |
| 152 | +1. **fail2ban blocking SSH** - ❌ **RULED OUT** - Test 8.1 passed |
| 153 | +2. **Docker network interference** - ❌ **RULED OUT** - Test 9.1 passed |
| 154 | +3. **sysctl optimizations** - ❌ **RULED OUT** - Test 10.1 passed |
| 155 | +4. **unattended-upgrades** - ❌ **RULED OUT** - Test 11.1 passed |
| 156 | +5. **Additional Torrust packages** - ❌ **RULED OUT** - Test 12.1 passed |
| 157 | +6. **Docker Compose installation** - ❌ **RULED OUT** - Test 13.1 passed |
| 158 | +7. **Additional UFW rules** - ❌ **RULED OUT** - Test 14.1 passed |
| 159 | +8. **Docker restart command** - ❌ **RULED OUT** - Test 15.1 passed |
| 160 | + |
| 161 | +**NEW HYPOTHESES - ROOT CAUSE ANALYSIS:** |
| 162 | + |
| 163 | +1. **Component interactions** - ⚠️ **LIKELY** - Multiple components interfering |
| 164 | +2. **Timing issues** - ⚠️ **LIKELY** - Race conditions during startup |
| 165 | +3. **Service dependencies** - ⚠️ **LIKELY** - Services starting in wrong order |
| 166 | +4. **Cumulative resource usage** - ⚠️ **POSSIBLE** - Memory/CPU constraints |
| 167 | +5. **Configuration file conflicts** - ⚠️ **POSSIBLE** - Overlapping configs |
| 168 | +6. **SSH service restart timing** - ⚠️ **POSSIBLE** - SSH restart conflicts with other services |
| 169 | + |
| 170 | +## Technical Details |
| 171 | + |
| 172 | +- **VM Environment**: libvirt/KVM with Ubuntu 22.04 cloud image |
| 173 | +- **SSH Configuration**: Both key and password authentication enabled |
| 174 | +- **Network**: UFW firewall with SSH explicitly allowed |
| 175 | +- **Testing Tools**: ssh, sshpass, nc, virsh net-dhcp-leases |
| 176 | + |
| 177 | +## Files Created |
| 178 | + |
| 179 | +- `user-data-minimal.yaml.tpl` - Baseline working config |
| 180 | +- `user-data-test-1.1.yaml.tpl` - + torrust user |
| 181 | +- `user-data-test-2.1.yaml.tpl` - + basic packages |
| 182 | +- `user-data-test-3.1.yaml.tpl` - + SSH config |
| 183 | +- `user-data-test-3.2.yaml.tpl` - + SSH restart |
| 184 | +- `user-data-test-5.1.yaml.tpl` - + UFW firewall |
| 185 | +- `user-data-test-7.1.yaml.tpl` - + reboot |
| 186 | +- `user-data.yaml.tpl` - Full config (broken) |
| 187 | + |
| 188 | +## Current Action |
| 189 | + |
| 190 | +Creating incremental tests to isolate the exact component causing SSH failure. |
| 191 | + |
| 192 | +## 🎉 FINAL RESOLUTION AND SUCCESS ✅ |
| 193 | + |
| 194 | +**DATE:** July 4, 2025 |
| 195 | +**STATUS:** ✅ COMPLETELY RESOLVED |
| 196 | + |
| 197 | +### Root Cause Confirmed |
| 198 | + |
| 199 | +The SSH authentication failure in the Torrust Tracker Demo VM was caused by **the YAML document start marker (`---`) at the beginning of the cloud-init configuration file**. |
| 200 | + |
| 201 | +### The Fix |
| 202 | + |
| 203 | +**Simple but Critical Change:** |
| 204 | + |
| 205 | +```yaml |
| 206 | +# BEFORE (BROKEN): |
| 207 | +--- |
| 208 | +# cloud-config |
| 209 | + |
| 210 | +# AFTER (FIXED): |
| 211 | +#cloud-config |
| 212 | +``` |
| 213 | + |
| 214 | +### Validation Results |
| 215 | + |
| 216 | +**Fresh deployment using make commands:** |
| 217 | + |
| 218 | +1. `make destroy` - Clean slate |
| 219 | +2. `make init` - Initialize OpenTofu |
| 220 | +3. `make plan` - Verified SSH key templating is correct |
| 221 | +4. `make apply` - Deployed fresh VM |
| 222 | + |
| 223 | +**Authentication Test Results:** |
| 224 | + |
| 225 | +- ✅ **SSH Key Authentication **: `ssh [email protected]` - SUCCESS |
| 226 | +- ✅ **Password Authentication **: `sshpass -p 'torrust123' ssh [email protected]` - SUCCESS |
| 227 | +- ✅ **All Cloud-Init Features**: Docker, UFW, packages, etc. - ALL WORKING |
| 228 | + |
| 229 | +### Technical Details |
| 230 | + |
| 231 | +**The Problem:** |
| 232 | + |
| 233 | +- Cloud-init parser expects `#cloud-config` as the first line |
| 234 | +- Using YAML document start marker `---` causes the entire configuration to be misprocessed |
| 235 | +- This breaks SSH key templating (`${ssh_public_key}` becomes `None`) |
| 236 | +- Results in empty `ssh_authorized_keys` and authentication failures |
| 237 | + |
| 238 | +**The Solution:** |
| 239 | + |
| 240 | +- Replace `---` with `#cloud-config` at the beginning of `user-data.yaml.tpl` |
| 241 | +- This ensures proper cloud-init parsing and SSH key templating |
| 242 | +- All other cloud-init features continue to work correctly |
| 243 | + |
| 244 | +### Impact |
| 245 | + |
| 246 | +This fix resolves the SSH authentication issue that was preventing users from accessing the Torrust Tracker Demo VM. The infrastructure is now working as designed with both SSH key and password authentication enabled. |
| 247 | + |
| 248 | +**Files Fixed:** |
| 249 | + |
| 250 | +- `infrastructure/cloud-init/user-data.yaml.tpl` - Header changed from `---` to `#cloud-config` |
| 251 | + |
| 252 | +**Deployment Method:** |
| 253 | + |
| 254 | +- Standard make commands work perfectly: `make init`, `make plan`, `make apply` |
| 255 | +- Integration testing workflow is fully operational |
| 256 | + |
| 257 | +## ROOT CAUSE IDENTIFIED AND CONFIRMED ✅ |
0 commit comments