A high-performance computer vision system that performs real-time image segmentation using K-Means clustering with multiple parallel backends for optimal performance across different hardware configurations.
 Real-time parallel K-means clustering: original vs segmented frames with dynamic K-value control via slider
Real-time parallel K-means clustering: original vs segmented frames with dynamic K-value control via slider
- ๐ Real-Time Performance: Up to 55+ FPS with CUDA backend on live webcam feeds
- ๐ง Multiple Parallel Backends: Sequential, Multi-threaded, MPI + OpenMP, and CUDA implementations
- ๐ณ RCC Tree Optimization: Recursive Cached Coreset tree for efficient streaming segmentation
- ๐ฏ 5D Feature Space: Combines color (BGR) and spatial (x,y) features for coherent segmentation
- โก Dynamic Backend Switching: Switch between backends in real-time with keyboard shortcuts
- ๐ Performance Monitoring: Live FPS tracking with min/max statistics
- ๐ผ๏ธ Interactive Controls: Adjustable K-value slider and side-by-side visualization
This project implements an advanced K-Means clustering system optimized for real-time image segmentation with multiple parallelization strategies:
- Core Algorithm: K-Means clustering adapted for image segmentation
- Feature Engineering: 5D vectors combining color similarity and spatial proximity
- Coreset Sampling: Reduces computational complexity from O(nยทkยทt) to O(sยทkยทt), where n = total pixels, s = coreset size (s โช n)
- 
RCC Tree Structure: Maintains $(1 \pm \epsilon)$ -approximation with bounded memory
- Hardware Optimization: Leverages multi-core CPUs, distributed systems, and GPUs
| Backend | K=2 (Min FPS) | K=2 (Max FPS) | K=12 (Min FPS) | K=12 (Max FPS) | Performance Ratio | 
|---|---|---|---|---|---|
| Sequential | 15 | 17 | 5 | 6 | 1.0ร (baseline) | 
| Multi-threaded | 14 | 44 | 10 | 22 | 2.4ร average | 
| MPI | 17 | 44 | 13 | 21 | 2.6ร average | 
| CUDA | 14 | 55 | 15 | 44 | 3.2ร average | 
Performance Improvement Factor (vs Sequential):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CUDA:          โโโโโโโโโโโโโโโโ 3.2ร            โ
โ MPI:           โโโโโโโโโโโโโโ 2.6ร              โ
โ Multi-thread:  โโโโโโโโโโโโ 2.4ร                โ
โ Sequential:    โโโโ 1.0ร (baseline)             โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Component | Complexity | Notes | 
|---|---|---|
| Sequential K-Means | Baseline implementation | |
| Coreset K-Means | Reduced complexity with | |
| RCC Tree Insertion | Streaming update per frame | |
| RCC Tree Merging | Weighted merge operation | |
| GPU Assignment | Massive parallelization | 
- Strategy: Row-based work distribution across CPU threads
- Synchronization: Lock-free design with barrier synchronization
- Memory Safety: Read-only sharing with exclusive write regions
- Best For: Multi-core desktop systems
- Strategy: Master-Worker pattern combining process-level (MPI) distribution of rows with thread-level (OpenMP) parallelization within each process.
- Communication: Uses MPI_Bcast to distribute cluster centers, frame data, and control parameters (k, dimensions, stop flag) and MPI_Gatherv for efficient, variable-sized result aggregation.
- Hybrid Approach: MPI collective calls act as synchronization barriers, ensuring data consistency across the network.
- Best For: HPC clusters and large distributed systems
- Strategy: One thread per pixel for maximum throughput
- Memory Management: Efficient host-device transfers
- Synchronization: cudaDeviceSynchronize() for completion barriers
- Best For: High-throughput applications with GPU hardware
# System Requirements
- C++17 compatible compiler (GCC/MSVC/Clang)
- CMake 3.18+
- OpenCV 4.0+
- CUDA Toolkit (for GPU backend)
- MPI implementation (e.g., OpenMPI/MS-MPI for distributed backend)# Clone the repository
git clone https://github.com/dosqas/realtime-parallel-kmeans-segmentation.git
cd realtime-parallel-kmeans-segmentation
# Create build directory
mkdir build && cd build
# Configure with CMake
cmake ..
# Build the project
cmake --build . --config Release
# Change to the output directory
cd out/build/x64-Debug
# Run the application ()
mpiexec -n <number_of_processes> realtime_parallel_kmeans_segmentation.exe| Key | Action | 
|---|---|
| '1' | Switch to Sequential backend | 
| '2' | Switch to Multi-threaded backend | 
| '3' | Switch to MPI backend | 
| '4' | Switch to CUDA backend | 
| ESC | Exit application | 
| K Slider | Adjust cluster count (2-12) | 
// Default configuration values
const int k_min = 1;           // Minimum clusters
const int k_max = 12;          // Maximum clusters
const int sample_size = 2000;  // Coreset size
const float color_scale = 1.0f;   // Color feature scaling
const float spatial_scale = 0.5f; // Spatial feature scalingconst int max_levels = 8;      // Maximum tree height
const int default_sample = 2000; // Default coreset size| FPS Range | Classification | User Experience | Recommended Backends | 
|---|---|---|---|
| 30+ FPS | Excellent | Smooth real-time | CUDA, Multi-threaded, MPI Hybrid (Kโค8) | 
| 20-30 FPS | Good | Acceptable real-time | Multi-threaded, MPI (Kโค6) | 
| 10-20 FPS | Fair | Noticeable lag | All backends (Kโค4) | 
| <10 FPS | Poor | Choppy playback | Sequential only (high K) | 
- Real-time video segmentation for streaming applications
- Live broadcast effects and background replacement
- Content creation and video editing workflows
- Baseline implementation for segmentation algorithms
- Performance benchmarking across different hardware
- Educational demonstrations of parallel computing concepts
- Real-time analysis of medical imagery
- Interactive segmentation for diagnostic applications
- High-throughput batch processing of medical data
- Real-time augmented reality applications
- Interactive art installations
- Gaming and entertainment systems
// In clustering_backends.hpp
enum Backend { 
    BACKEND_SEQ = 0, 
    BACKEND_CUDA = 1, 
    BACKEND_THR = 2, 
    BACKEND_MPI = 3,
    BACKEND_CUSTOM = 4  // Your custom backend
};
// Implement your backend function
cv::Mat segmentFrameWithKMeans_custom(
    const cv::Mat& frame, int k, int sample_size,
    float color_scale, float spatial_scale);// Adjust coreset sampling for speed vs quality trade-off
const int fast_sample_size = 1000;    // Faster, lower quality
const int quality_sample_size = 5000; // Slower, higher quality
// Modify feature scaling for different segmentation characteristics
const float color_emphasis = 2.0f;    // Emphasize color similarity
const float spatial_emphasis = 0.1f;  // De-emphasize spatial proximityrealtime-parallel-kmeans-segmentation/
โโโ ๐ include/                      # Header files
โ   โโโ clustering.hpp               # Main clustering interface
โ   โโโ clustering_backends.hpp      # Backend implementations
โ   โโโ coreset.hpp                  # Coreset data structures
โ   โโโ rcc.hpp                      # RCC tree implementation
โ   โโโ utils.hpp                    # Utility functions
โ   โโโ video_io.hpp                 # Video I/O interface
โโโ ๐ src/                          # Source files
โ   โโโ ๐ clustering/               # Backend implementations
โ   โ   โโโ clustering_cuda.cu       # CUDA GPU backend
โ   โ   โโโ clustering_entry.cpp     # Backend dispatcher
โ   โ   โโโ clustering_mpi.cpp       # MPI distributed backend
โ   โ   โโโ clustering_seq.cpp       # Sequential CPU backend
โ   โ   โโโ clustering_thr.cpp       # Multi-threaded backend
โ   โโโ coreset.cpp                  # Coreset algorithms
โ   โโโ main.cpp                     # Application entry point
โ   โโโ rcc.cpp                      # RCC tree implementation
โ   โโโ utils.cpp                    # Utility functions
โ   โโโ video_io.cpp                 # Video I/O implementation
โโโ ๐ docs/                         # Documentation
โ   โโโ project__demo.gif            # Program demonstration GIF
โโโ ๐ docs/                         # Documentation
โ   โโโ algorithms.md                # Algorithm descriptions
โ   โโโ parallelization.md           # Synchronization details
โ   โโโ performance.md               # Performance analysis
โโโ ๐ tests/                        # Test files
โ   โโโ test_clustering.cpp          # Clustering tests
โ   โโโ test_coreset.cpp             # Coreset tests
โ   โโโ test_rcc_.cpp                # RCC tree tests
โ   โโโ test_utils.cpp               # Utility tests
โ   โโโ test_video_io_.cpp           # Video I/O tests
โโโ CMakeLists.txt                   # Build configuration
โโโ LICENSE                          # MIT License
โโโ README.md                        # This file
The RCC tree enables efficient streaming K-means by:
- Leaf Insertion: New frame coresets inserted with carry propagation
- Node Merging: Weighted coreset combination with bounded size
- Root Computation: Dynamic merging of all levels for comprehensive representation
- Memory Bounds: Tree height limited to prevent unbounded growth
- Multi-threaded: Lock-free design with const references and exclusive write regions
- MPI: Collective operations (MPI_Bcast, MPI_Gatherv) with hybrid OpenMP parallelization
- CUDA: Host-device synchronization with cudaDeviceSynchronize() barriers
- Memory Requirements: CUDA backend requires sufficient GPU memory for large images
- Network Dependency: MPI performance varies with network latency and bandwidth
- K-value Scaling: All backends show performance degradation with very high cluster counts
- Hardware Specific: Optimal performance depends on specific hardware configuration
- Adaptive Coreset Sizing: Dynamic adjustment based on image complexity
- Additional Color Spaces: Support for HSV, LAB, and other color representations
- Temporal Coherence: Frame-to-frame consistency improvements
- Mobile Optimization: ARM NEON and mobile GPU backend support
- Cloud Integration: Distributed processing across cloud instances
- OpenCV Team โ For comprehensive computer vision library and excellent documentation
- NVIDIA CUDA โ For GPU computing platform and development tools
- Open MPI Project โ For high-performance message passing interface
- CMake Community โ For cross-platform build system
- Research Community โ For foundational work on coreset algorithms and RCC trees
- Feldman, D., Schmidt, M., & Sohler, C. (2013). Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
- Bachem, O., Lucic, M., & Krause, A. (2017). Practical coreset constructions for machine learning
This project is licensed under the MIT License. See the LICENSE file for details.
Questions, feedback, or ideas? Reach out anytime at [email protected].