Skip to content

Conversation

@developerisnow
Copy link

@developerisnow developerisnow commented Nov 12, 2025

Problem

Strongly recommended start from reading issues #25
When running CodeMachine on remote servers (VPS, SSH), users encounter two issues:

  1. Process termination on SSH disconnect
  2. UI disappearance after reconnect (confusing - looks like crash)

These issues make long-running workflows on remote servers problematic.

Solution

This PR adds pragmatic deployment utilities with clear observability:

📦 What's Included

Monitoring Scripts (scripts/deployment/):

  • codemachine-persistent.sh - Launch with SSH disconnect survival (nohup)
  • codemachine-status.sh - Enhanced monitoring showing:
    • Running/completed/failed agents
    • Recent log activity
    • Process status and uptime
  • codemachine-stop.sh - Graceful shutdown

Documentation:

  • docs/vps-deployment.md - Complete deployment guide
  • scripts/deployment/README.md - Script usage and troubleshooting
  • README.md - Quick start section

✅ Key Features

  1. SSH Disconnect Survival: Process continues running using nohup
  2. Clear Status Visibility: Know if workflow is actually working
  3. Understanding UI Detachment: Documentation explaining this is expected (Ink framework limitation)
  4. Production-Ready: Generic scripts, no hardcoded paths

🎯 Value Proposition

Before:

  • Users think CodeMachine crashed (UI gone = broken?)
  • No way to verify process is working
  • Manual nohup setup needed

After:

  • Clear status: "Process running, UI detached (expected)"
  • One command to check: ./scripts/deployment/codemachine-status.sh
  • Ready-to-use scripts with documentation

📊 Testing

Tested on Ubuntu 20.04 VPS:

  • ✅ Process survives SSH disconnect (60+ minutes)
  • ✅ Status script shows accurate agent progress
  • ✅ Logs update correctly in headless mode
  • ✅ Graceful shutdown works

🔗 Related

Closes #25

📝 Technical Notes

Why UI detaches:

  • CodeMachine uses Ink (React for CLI)
  • Ink requires persistent TTY connection
  • SSH reconnect = new TTY = UI can't reconnect
  • This is architectural limitation, not a bug

Our approach:

  • Don't fix UI (would require major refactoring)
  • Provide observability tools
  • Document expected behavior clearly

🚀 Usage Example

# Start persistently
./scripts/deployment/codemachine-persistent.sh

# Close laptop, go to coffee shop, reconnect

# Check status
./scripts/deployment/codemachine-status.sh

Output:

✅ Process: RUNNING (PID: 123456)
   Uptime: 02:34:56

📊 Agent Status:
   🔄 code-generation-step-9: Code Generation Agent
   ✅ context-manager-step-8: Context Manager Agent

📝 Recently Updated Files (last 5):
   Nov 12 07:04: persistent-output.log
   Nov 12 07:04: agent-67-context-manager.log

Impact: Makes CodeMachine viable for VPS/remote deployment with clear observability.

Approach: Pragmatic - provide tools and documentation, not architectural changes.

moazbuilds and others added 30 commits October 10, 2025 03:05
updating definition
…d executions

The profile parameter was redundant as the agentId is sufficient for authentication purposes. This simplifies the API surface and reduces unnecessary configuration options across all engine providers (Claude, Cursor, Codex) and their execution flows.

Fixing json codex starting bug "can't find 'git-commit' profile"
…chitecture output paths

- Split tracking functions into template.ts and steps.ts for better organization
- Add new functions for tracking not completed steps
- Update all references from .codemachine/plan/ to .codemachine/artifacts/
- Add new output format templates for architecture and plan generation
- Implement glob pattern support in placeholder resolution
implement fallback agent execution when a step is marked as incomplete
add notCompletedFallback field to workflow step types
introduce fallback execution logic before original step runs
- Add plan-fallback and task-fallback agents to handle incomplete executions
- Increase default timeout from 10 to 30 minutes across all providers
- Refactor prompt placeholder processing into modular components
- Implement structured task output with manifest files
- Add support for optional placeholders with fallback behavior
Update placeholder configuration and template files to use consistent naming for task fallback files and variables. This ensures proper file resolution and variable interpolation in the task breakdown process.
Add markdown templates for codemachine workflows, agents, output formats, and fallback agents. These templates define the structure and behavior for various components in the codemachine system including task verification, git commit workflows, and agent-specific instructions.

The templates provide consistent formatting and guidelines for:
- Workflow definitions
- Agent roles and responsibilities
- Output format specifications
- Fallback agent behavior
…uration

- Standardize agent prompt file naming and paths
- Add context manager agent to workflow
- Update task verification workflow with detailed signaling
- Consolidate output format placeholders
- Adjust workflow loop steps for new context manager
refactor(validation): update task verification to use behavior.json
docs: update agent prompts to use context and code_fallback placeholders
build: add new placeholder paths for context and code_fallback files
…tructions

- Clarify the issues detected section with more specific error details
- Improve fix instructions with concrete action steps
- Update success workflow to use task list file instead of direct JSON output
Enable resuming long workflows by starting from the last
incomplete step tracked in .codemachine/template.json when
resumeFromLastStep is true.

Add getResumeStartIndex to compute the starting index based on
notCompletedSteps. Update workflow runner to begin iteration from
the computed index and log a resume notice. Expose the helper via
shared workflows index.

This improves reliability after interruptions without breaking
existing behavior (feature disabled by default).
…ete step

- Change resume behavior to use first incomplete step instead of last
- Add default values for tracking fields in template and step functions
- Add reasoning level validation and spinner logging to step command
- Bump package version to 0.3.0
- Replace 'plan' directory with 'artifacts' in workspace initialization
- Add promptPath validation for fallback agents
- Improve tool call handling in cursor engine runner
- Update workflow templates to use cursor engine by default
- Refactor loop behavior to use behavior.json instead of triggers
- Update README with new installation instructions and features
- Convert synchronous loop handling functions to async/await pattern
- Add detailed error logging for template tracking operations
- Improve file path handling with path.join() for cross-platform compatibility
- Enhance documentation for timestamp format in tracking interfaces
- Update tests to handle async loop evaluation
- Add update-notifier package to check for updates daily
- Include dim color in chalk palette for CLI styling
- Update version from 0.1.0 to 0.3.0
- Enhance error messages with resume workflow instructions
- Add comprehensive CONTRIBUTING.md with contribution guidelines and workflow documentation
…ings

- Rename architecture agent to standard naming convention
- Streamline workflow engine configurations by removing redundant model specifications
Update README with improved formatting, updated statistics, and added comparison table
Remove docs directory from gitignore as documentation is now tracked
Update documentation templates to reflect the new plan artifacts directory path from `.codemachine/artifacts/` to `.codemachine/artifacts/plan/` for consistency
…oopTrigger

remove outdated engine authentication by removing profile-specific configuration
Remove deprecated loopTrigger from module metadata and workflow configs
- Simplify start command to run workflow queue directly in headless mode
- improve DIM and update npm notifier
- make CLI path in main menu dynamic
…tion

- Introduce trigger behavior type and evaluation logic
- Implement trigger controller and execution handling
- Add iteration-checker module with trigger capability
- Support dynamic agent triggering during workflow execution
moazbuilds and others added 24 commits November 5, 2025 18:47
The previous type assertion was incorrect and could cause runtime errors. Using NodeJS.EventEmitter ensures proper type safety when emitting events.
Add more specific patterns to exclude from version control including cache files, logs and temporary files
Update workflow steps to use 'cursor' engine with 'grok' model for
plan-agent, task-breakdown, code-generation, runtime-prep, task-sanity-check,
and git-commit steps. Change context-manager to use 'ccr' engine.
Remove engine and fallback from plan-agent.
…ption

Proper-lockfile requires the target file to exist. Added file creation logic and changed error handling to throw instead of degrade silently to prevent data corruption from concurrent access.
Replace async file operations with sync ones during registry file creation to ensure atomicity
Add validation for empty/invalid registry files to handle edge cases
Switch engine assignments for various steps to improve consistency and reliability
   - Add loopReason field to display task progress (e.g., "Cycle 2 - Task 5/50")
   - Update task verification workflow to report progress as "Task X/Y"
   - Pass loop reason from workflow execution to UI state
   - Fix bug where loopReason was cleared during agent reset
…d rules

Expand the anchor insertion documentation with specific formatting requirements, placement rules, and workflow instructions. Add verification steps and example to ensure compliance.
- Increment index before accessing next argument in publish script for clarity
- Replace tseslint.config with defineConfig for better type safety in eslint config
Integrate OpenCode CLI as a native engine provider alongside Claude, Codex, Cursor, and CCR. This enables users to run `codemachine opencode run` and use `--engine opencode` in workflow steps.

Key features:
- JSON event streaming with consistent log markers
- Non-interactive execution via environment guardrails (OPENCODE_PERMISSION, OPENCODE_DISABLE_LSP_DOWNLOAD)
- Registry-based provider pattern matching existing engines
- Comprehensive documentation updates across README, CLI reference, and workflow guides

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Refactor auth system to use XDG-compliant paths under OPENCODE_HOME
- Add support for multiple authentication providers in OpenCode CLI
- Improve user guidance during auth flow with interactive prompts
- Centralize all OpenCode data under ~/.codemachine/opencode
- Update README to credit contributor for OpenCode integration
Add process group killing for Unix systems to handle Node.js wrapper scripts
Improve logging and error handling for process termination
Handle race conditions between abort signals and process events
feat: add OpenCode CLI as first-class engine integration
Move telemetry parsing logic into dedicated provider modules for better maintainability
Improve telemetry capture by handling provider-specific formats consistently
Ensure environment variables from process.env are included when resolving runner environment to prevent missing variables
Update CCR engine name to 'Claude Code Router' and Claude engine name to 'Claude Code' to better reflect their purpose and functionality
Remove special logout handling for CCR in CLI command
Replace .credentials.json with simple .enable marker file
Update related auth functions to reflect new simpler approach
- Replace 'any' with 'unknown' and add proper type guards
- Update eslint config to use tseslint.config
- Rename CCR engine display name to 'Claude Code Router'
Add deployment utilities for running CodeMachine on remote servers:

Scripts:
- codemachine-persistent.sh: Launch with SSH disconnect survival (nohup)
- codemachine-status.sh: Enhanced monitoring (agents, logs, errors)
- codemachine-stop.sh: Graceful shutdown with cleanup

Documentation:
- docs/vps-deployment.md: Complete VPS deployment guide
- scripts/deployment/README.md: Script usage and troubleshooting
- README.md: Quick start section for remote servers

Addresses common issues:
- Process termination on SSH disconnect
- UI detachment after reconnect (Ink framework limitation)
- Monitoring headless workflows

Closes moazbuilds#25
@moazbuilds
Copy link
Owner

Thanks for adding the VPS deployment scripts, We’re planning to move to Bun runtime and do some major adjustments, and once that transition is complete, we can revisit this PR as a strong foundation for long-running VPS deployments.

@moazbuilds moazbuilds removed their request for review November 13, 2025 20:38
@moazbuilds moazbuilds added the enhancement New feature or request label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

codemachine crashes on SSH disconnect on Ubuntu VPS

5 participants