Skip to content

Real-time C++ K-means image segmentation on live video streams, using OpenCV, RCC trees, and 5D features, optimized for consumer hardware with Sequential, Multi-threaded, MPI, and CUDA backends.

License

Notifications You must be signed in to change notification settings

dosqas/realtime-parallel-kmeans-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

30 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽจ Real-Time Parallel K-Means Image Segmentation

A high-performance computer vision system that performs real-time image segmentation using K-Means clustering with multiple parallel backends for optimal performance across different hardware configurations.

Project demo Real-time parallel K-means clustering: original vs segmented frames with dynamic K-value control via slider


โœจ Key Features

  • ๐Ÿš€ Real-Time Performance: Up to 55+ FPS with CUDA backend on live webcam feeds
  • ๐Ÿ”ง Multiple Parallel Backends: Sequential, Multi-threaded, MPI + OpenMP, and CUDA implementations
  • ๐ŸŒณ RCC Tree Optimization: Recursive Cached Coreset tree for efficient streaming segmentation
  • ๐ŸŽฏ 5D Feature Space: Combines color (BGR) and spatial (x,y) features for coherent segmentation
  • โšก Dynamic Backend Switching: Switch between backends in real-time with keyboard shortcuts
  • ๐Ÿ“Š Performance Monitoring: Live FPS tracking with min/max statistics
  • ๐Ÿ–ผ๏ธ Interactive Controls: Adjustable K-value slider and side-by-side visualization

๐Ÿง  Technical Overview

This project implements an advanced K-Means clustering system optimized for real-time image segmentation with multiple parallelization strategies:

  • Core Algorithm: K-Means clustering adapted for image segmentation
  • Feature Engineering: 5D vectors combining color similarity and spatial proximity
  • Coreset Sampling: Reduces computational complexity from O(nยทkยทt) to O(sยทkยทt), where n = total pixels, s = coreset size (s โ‰ช n)
  • RCC Tree Structure: Maintains $(1 \pm \epsilon)$-approximation with bounded memory
  • Hardware Optimization: Leverages multi-core CPUs, distributed systems, and GPUs

๐Ÿ“Š Performance Benchmarks

FPS Performance by Backend and K-value

Backend K=2 (Min FPS) K=2 (Max FPS) K=12 (Min FPS) K=12 (Max FPS) Performance Ratio
Sequential 15 17 5 6 1.0ร— (baseline)
Multi-threaded 14 44 10 22 2.4ร— average
MPI 17 44 13 21 2.6ร— average
CUDA 14 55 15 44 3.2ร— average

Performance Characteristics

Performance Improvement Factor (vs Sequential):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CUDA:          โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 3.2ร—            โ”‚
โ”‚ MPI:           โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2.6ร—              โ”‚
โ”‚ Multi-thread:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 2.4ร—                โ”‚
โ”‚ Sequential:    โ–ˆโ–ˆโ–ˆโ–ˆ 1.0ร— (baseline)             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ—๏ธ Architecture Details

Algorithmic Complexity

Component Complexity Notes
Sequential K-Means $O(n \cdot k \cdot t)$ Baseline implementation
Coreset K-Means $O(s \cdot k \cdot t)$ Reduced complexity with $s \ll n$
RCC Tree Insertion $O(s \cdot \log N)$ Streaming update per frame
RCC Tree Merging $O(s)$ Weighted merge operation
GPU Assignment $O(n / \text{cores})$ Massive parallelization

Backend-Specific Implementations

๐Ÿ”„ Multi-threaded (std::thread)

  • Strategy: Row-based work distribution across CPU threads
  • Synchronization: Lock-free design with barrier synchronization
  • Memory Safety: Read-only sharing with exclusive write regions
  • Best For: Multi-core desktop systems

๐ŸŒ Distributed (MPI + OpenMP)

  • Strategy: Master-Worker pattern combining process-level (MPI) distribution of rows with thread-level (OpenMP) parallelization within each process.
  • Communication: Uses MPI_Bcast to distribute cluster centers, frame data, and control parameters (k, dimensions, stop flag) and MPI_Gatherv for efficient, variable-sized result aggregation.
  • Hybrid Approach: MPI collective calls act as synchronization barriers, ensuring data consistency across the network.
  • Best For: HPC clusters and large distributed systems

๐ŸŽฎ GPU-accelerated (CUDA)

  • Strategy: One thread per pixel for maximum throughput
  • Memory Management: Efficient host-device transfers
  • Synchronization: cudaDeviceSynchronize() for completion barriers
  • Best For: High-throughput applications with GPU hardware

๐Ÿš€ Getting Started

Prerequisites

# System Requirements
- C++17 compatible compiler (GCC/MSVC/Clang)
- CMake 3.18+
- OpenCV 4.0+
- CUDA Toolkit (for GPU backend)
- MPI implementation (e.g., OpenMPI/MS-MPI for distributed backend)

Building the Project

# Clone the repository
git clone https://github.com/dosqas/realtime-parallel-kmeans-segmentation.git
cd realtime-parallel-kmeans-segmentation

# Create build directory
mkdir build && cd build

# Configure with CMake
cmake ..

# Build the project
cmake --build . --config Release

# Change to the output directory
cd out/build/x64-Debug

# Run the application ()
mpiexec -n <number_of_processes> realtime_parallel_kmeans_segmentation.exe

Runtime Controls

Key Action
'1' Switch to Sequential backend
'2' Switch to Multi-threaded backend
'3' Switch to MPI backend
'4' Switch to CUDA backend
ESC Exit application
K Slider Adjust cluster count (2-12)

๐Ÿ”ง Configuration

Algorithm Parameters

// Default configuration values
const int k_min = 1;           // Minimum clusters
const int k_max = 12;          // Maximum clusters
const int sample_size = 2000;  // Coreset size
const float color_scale = 1.0f;   // Color feature scaling
const float spatial_scale = 0.5f; // Spatial feature scaling

RCC Tree Settings

const int max_levels = 8;      // Maximum tree height
const int default_sample = 2000; // Default coreset size

๐Ÿ“ˆ Real-Time Performance Thresholds

FPS Range Classification User Experience Recommended Backends
30+ FPS Excellent Smooth real-time CUDA, Multi-threaded, MPI Hybrid (Kโ‰ค8)
20-30 FPS Good Acceptable real-time Multi-threaded, MPI (Kโ‰ค6)
10-20 FPS Fair Noticeable lag All backends (Kโ‰ค4)
<10 FPS Poor Choppy playback Sequential only (high K)

๐ŸŽฏ Use Cases

๐ŸŽฌ Video Processing

  • Real-time video segmentation for streaming applications
  • Live broadcast effects and background replacement
  • Content creation and video editing workflows

๐Ÿค– Computer Vision Research

  • Baseline implementation for segmentation algorithms
  • Performance benchmarking across different hardware
  • Educational demonstrations of parallel computing concepts

๐Ÿฅ Medical Imaging

  • Real-time analysis of medical imagery
  • Interactive segmentation for diagnostic applications
  • High-throughput batch processing of medical data

๐ŸŽฎ Interactive Applications

  • Real-time augmented reality applications
  • Interactive art installations
  • Gaming and entertainment systems

๐Ÿ› ๏ธ Customization

Adding New Backends

// In clustering_backends.hpp
enum Backend { 
    BACKEND_SEQ = 0, 
    BACKEND_CUDA = 1, 
    BACKEND_THR = 2, 
    BACKEND_MPI = 3,
    BACKEND_CUSTOM = 4  // Your custom backend
};

// Implement your backend function
cv::Mat segmentFrameWithKMeans_custom(
    const cv::Mat& frame, int k, int sample_size,
    float color_scale, float spatial_scale);

Tuning Performance

// Adjust coreset sampling for speed vs quality trade-off
const int fast_sample_size = 1000;    // Faster, lower quality
const int quality_sample_size = 5000; // Slower, higher quality

// Modify feature scaling for different segmentation characteristics
const float color_emphasis = 2.0f;    // Emphasize color similarity
const float spatial_emphasis = 0.1f;  // De-emphasize spatial proximity

๐Ÿ“‚ Project Structure

realtime-parallel-kmeans-segmentation/
โ”œโ”€โ”€ ๐Ÿ“ include/                      # Header files
โ”‚   โ”œโ”€โ”€ clustering.hpp               # Main clustering interface
โ”‚   โ”œโ”€โ”€ clustering_backends.hpp      # Backend implementations
โ”‚   โ”œโ”€โ”€ coreset.hpp                  # Coreset data structures
โ”‚   โ”œโ”€โ”€ rcc.hpp                      # RCC tree implementation
โ”‚   โ”œโ”€โ”€ utils.hpp                    # Utility functions
โ”‚   โ””โ”€โ”€ video_io.hpp                 # Video I/O interface
โ”œโ”€โ”€ ๐Ÿ“ src/                          # Source files
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ clustering/               # Backend implementations
โ”‚   โ”‚   โ”œโ”€โ”€ clustering_cuda.cu       # CUDA GPU backend
โ”‚   โ”‚   โ”œโ”€โ”€ clustering_entry.cpp     # Backend dispatcher
โ”‚   โ”‚   โ”œโ”€โ”€ clustering_mpi.cpp       # MPI distributed backend
โ”‚   โ”‚   โ”œโ”€โ”€ clustering_seq.cpp       # Sequential CPU backend
โ”‚   โ”‚   โ””โ”€โ”€ clustering_thr.cpp       # Multi-threaded backend
โ”‚   โ”œโ”€โ”€ coreset.cpp                  # Coreset algorithms
โ”‚   โ”œโ”€โ”€ main.cpp                     # Application entry point
โ”‚   โ”œโ”€โ”€ rcc.cpp                      # RCC tree implementation
โ”‚   โ”œโ”€โ”€ utils.cpp                    # Utility functions
โ”‚   โ””โ”€โ”€ video_io.cpp                 # Video I/O implementation
โ”œโ”€โ”€ ๐Ÿ“ docs/                         # Documentation
โ”‚   โ”œโ”€โ”€ project__demo.gif            # Program demonstration GIF
โ”œโ”€โ”€ ๐Ÿ“ docs/                         # Documentation
โ”‚   โ”œโ”€โ”€ algorithms.md                # Algorithm descriptions
โ”‚   โ”œโ”€โ”€ parallelization.md           # Synchronization details
โ”‚   โ””โ”€โ”€ performance.md               # Performance analysis
โ”œโ”€โ”€ ๐Ÿ“ tests/                        # Test files
โ”‚   โ”œโ”€โ”€ test_clustering.cpp          # Clustering tests
โ”‚   โ”œโ”€โ”€ test_coreset.cpp             # Coreset tests
โ”‚   โ”œโ”€โ”€ test_rcc_.cpp                # RCC tree tests
โ”‚   โ”œโ”€โ”€ test_utils.cpp               # Utility tests
โ”‚   โ””โ”€โ”€ test_video_io_.cpp           # Video I/O tests
โ”œโ”€โ”€ CMakeLists.txt                   # Build configuration
โ”œโ”€โ”€ LICENSE                          # MIT License
โ””โ”€โ”€ README.md                        # This file

๐Ÿ”ฌ Technical Deep Dive

Recursive Cached Coreset (RCC) Tree

The RCC tree enables efficient streaming K-means by:

  1. Leaf Insertion: New frame coresets inserted with carry propagation
  2. Node Merging: Weighted coreset combination with bounded size
  3. Root Computation: Dynamic merging of all levels for comprehensive representation
  4. Memory Bounds: Tree height limited to prevent unbounded growth

Synchronization Strategies

  • Multi-threaded: Lock-free design with const references and exclusive write regions
  • MPI: Collective operations (MPI_Bcast, MPI_Gatherv) with hybrid OpenMP parallelization
  • CUDA: Host-device synchronization with cudaDeviceSynchronize() barriers

๐Ÿงช Known Limitations

  1. Memory Requirements: CUDA backend requires sufficient GPU memory for large images
  2. Network Dependency: MPI performance varies with network latency and bandwidth
  3. K-value Scaling: All backends show performance degradation with very high cluster counts
  4. Hardware Specific: Optimal performance depends on specific hardware configuration

๐Ÿ”ฎ Possible Future Enhancements

  • Adaptive Coreset Sizing: Dynamic adjustment based on image complexity
  • Additional Color Spaces: Support for HSV, LAB, and other color representations
  • Temporal Coherence: Frame-to-frame consistency improvements
  • Mobile Optimization: ARM NEON and mobile GPU backend support
  • Cloud Integration: Distributed processing across cloud instances

๐Ÿ™ Acknowledgments

  • OpenCV Team โ€“ For comprehensive computer vision library and excellent documentation
  • NVIDIA CUDA โ€“ For GPU computing platform and development tools
  • Open MPI Project โ€“ For high-performance message passing interface
  • CMake Community โ€“ For cross-platform build system
  • Research Community โ€“ For foundational work on coreset algorithms and RCC trees

Key References

  • Feldman, D., Schmidt, M., & Sohler, C. (2013). Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
  • Bachem, O., Lucic, M., & Krause, A. (2017). Practical coreset constructions for machine learning

๐Ÿ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.


๐Ÿ’ก Contact

Questions, feedback, or ideas? Reach out anytime at [email protected].

About

Real-time C++ K-means image segmentation on live video streams, using OpenCV, RCC trees, and 5D features, optimized for consumer hardware with Sequential, Multi-threaded, MPI, and CUDA backends.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published