Hardware-aware inference platform with intelligent backend selection
Production-grade server for local AI inference with automatic hardware detection, optimal configuration, and robust process management.
- π Auto Hardware Detection - Automatically detects CPUs, GPUs (NVIDIA/AMD/Intel), VRAM, and acceleration capabilities
- π― Smart Backend Selection - Intelligently selects optimal backend based on available hardware
- π VRAM-Aware Configuration - Calculates optimal GPU layer offloading based on available memory
- π Multi-Backend Support - BitNet, LM Studio, and extensible architecture
- βοΈ Robust Process Management - Health checking, graceful shutdown, and port conflict resolution
- π Curated Model Library - 8 pre-configured models with HuggingFace integration
- πͺ 100% Strong Typing - 35+ Enums, 20+ Pydantic models, zero magic strings
- π§ͺ CLI Tools - Built-in testing and administration commands
cd Server/
pip install -r requirements.txt# Show system information
python cli.py info
# List available backends
python cli.py backends
# Test backend selection
python cli.py test bitnet_1.58 --size 3.5
# JSON output
python cli.py info --format jsonfrom core import (
create_hardware_detector,
BackendSelector,
ModelLibrary,
)
# Hardware detection
detector = create_hardware_detector()
hw_info = detector.get_hardware_info()
print(f"GPUs: {hw_info.nvidia_gpus}")
# Backend selection
selector = BackendSelector()
result = selector.select_backend(ModelType.BITNET_158, model_size_gb=3.5)
print(f"Selected: {result.backend}, ngl: {result.ngl}")
# Model library
library = ModelLibrary()
models = library.get_recommended_models()
for model in models:
print(f"{model.name} - {model.size_gb}GB")core/ - Types, enums, and configuration (foundation layer)
hardware/ - Hardware detection and backend selection
server_mgmt/ - Port allocation and process lifecycle
models/ - Model library and download management
backends/ - Inference backend implementations
See: docs/ARCHITECTURE.md for details
| OS | Status | Detection |
|---|---|---|
| Windows | β Complete | WMI + nvidia-smi |
| Linux | π§ Planned | lspci + nvidia-smi |
| macOS | π§ Planned | system_profiler |
- CPUs - Name, cores, threads, clock speed
- NVIDIA GPUs - With VRAM via nvidia-smi
- AMD GPUs - Discrete/integrated classification
- Intel GPUs - Arc and integrated
- Acceleration - CUDA, Vulkan, ROCm, Metal, DirectML
- NPUs - AMD Ryzen AI (planned)
- Detect all available hardware
- Check acceleration capabilities (CUDA, Vulkan, etc)
- Parse GPU VRAM amounts
- Calculate optimal layer offloading (ngl)
- Select best backend with confidence score
available_vram = total_vram - 2GB (reserved)
if available_vram >= model_size:
ngl = all_layers # Full GPU offload
else:
ngl = int((available_vram / model_size) * total_layers * 0.9)
Example: 7B model (5GB) + 8GB GPU = 32 layers offloaded (full)
- Llama 3.2 (1B, 3B) - Fast instruction models
- Phi-4 (14B) - Microsoft's reasoning model
- Qwen 2.5 Coder (7B) - Best coding model
- Qwen 2.5 (14B) - General purpose
- Gemma 2 (2B) - Google's efficient model
- BitNet 3B - 1.58-bit quantized
Each model includes:
- HuggingFace repository
- Available quantization variants
- Size, context length, use cases
- License information
35+ Enums defined for all constants
20+ Pydantic Models for data validation
Zero Magic Strings - all strings are enum values
Zero Magic Numbers - all numbers are enum values
100% Type Coverage - complete type hints
# Strong typing everywhere
from core import BackendType, ServerType, ModelType
backend = BackendType.BITNET_CPU # Not "bitnet_cpu"
port = DefaultPort.BITNET_CPU.value # Not 8081
model = ModelType.BITNET_158 # Not "bitnet_1.58"- Architecture - System design and components
- API Reference - Complete API documentation
- Features Analysis - Implementation status and roadmap
- Project Structure - File organization
- BitNet Integration - BitNet backend details
- Create folder in
backends/ - Implement
manager.pywith standard interface - Update backend routing
- Done!
- Edit
models/models_library.json - Add entry with metadata
- Done!
# Run CLI tests
python cli.py info
python cli.py backends
python cli.py test bitnet_1.58 --size 3.5
# Python tests
python -m pytest tests/- Python 3.9+
pydantic- Data validationrequests- HTTP client
torch- For CUDA detectionwmi- For Windows hardware detection (Windows only)huggingface-hub- For model downloads
pip install -r requirements.txt| Configuration | First Token | Throughput | VRAM |
|---|---|---|---|
| BitNet GPU (3B) | 50ms | 45 tok/s | 4GB |
| CUDA (7B Q4) | 80ms | 35 tok/s | 6GB |
| CPU (7B Q4) | 500ms | 8 tok/s | 0GB |
RTX 4090 + i9-12900K
β
Modular - Clean separation of concerns
β
Typed - 100% strong typing
β
Robust - Comprehensive error handling
β
Documented - Complete documentation
β
Extensible - Easy to add features
β
Clean - 0 lint errors
Apache 2.0 - See LICENSE file.