diff --git a/README.md b/README.md
index e517ef6..47f8631 100644
--- a/README.md
+++ b/README.md
@@ -10,10 +10,32 @@
How can we develop zero-shot detection of machine generated codes? Welcome to the repository for the research paper: **"Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers."** Our paper has been accepted to the 47th International Conference on Software Engineering (**ICSE 2025**).
+## NEW: C++ Edition with GUI
+
+We've created an enhanced version of DetectCodeGPT specifically for analyzing C++ projects! Features include:
+
+- 🚀 **GPU Acceleration**: Automatic GPU detection (CUDA/MPS) for faster processing
+- 🖥️ **User-Friendly GUI**: Intuitive interface for configuration and monitoring
+- 📁 **Multi-Project Support**: Scan entire directories with multiple C++ projects
+- 📊 **Comprehensive Reporting**: JSON output with detailed scores and human-readable reports
+- ⚡ **CLI Mode**: Command-line interface for automation
+
+**Quick Start:**
+
+```bash
+cd cpp-detector
+python main_app.py # Launch GUI
+# OR
+python main_app.py --cli --directory /path/to/cpp/projects # CLI mode
+```
+
+See [cpp-detector/README.md](cpp-detector/README.md) for detailed documentation.
+
## Table of Contents
- [DetectCodeGPT](#detectcodegpt)
- [Table of Contents](#table-of-contents)
+ - [NEW: C++ Edition with GUI](#new-c-edition-with-gui)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Data Preparation](#data-preparation)
diff --git a/cpp-detector/README.md b/cpp-detector/README.md
new file mode 100644
index 0000000..db2063b
--- /dev/null
+++ b/cpp-detector/README.md
@@ -0,0 +1,302 @@
+# DetectCodeGPT C++ Edition
+
+An enhanced version of DetectCodeGPT specifically designed to detect AI-generated C++ code across multiple projects. Features GPU acceleration, a user-friendly GUI, and comprehensive JSON reporting.
+
+## Features
+
+- **Multi-Project Support**: Scan entire directories containing multiple C++ projects
+- **GPU Acceleration**: Automatically detects and uses available GPUs (CUDA, MPS) for faster processing
+- **User-Friendly GUI**: Intuitive interface for configuration and real-time progress monitoring
+- **CLI Mode**: Command-line interface for automation and scripting
+- **Comprehensive Output**: JSON results with detailed scores and human-readable reports
+- **Flexible Configuration**: Customizable detection parameters and model selection
+
+## Installation
+
+### Prerequisites
+
+- Python 3.9.7 or higher
+- CUDA-capable GPU (optional, but recommended for better performance)
+
+### Install Dependencies
+
+From the repository root:
+
+```bash
+pip install -r requirements.txt
+```
+
+Or install specific requirements for the C++ detector:
+
+```bash
+cd cpp-detector
+pip install -r requirements.txt
+```
+
+## Usage
+
+### GUI Mode (Recommended)
+
+Launch the graphical interface:
+
+```bash
+cd cpp-detector
+python main_app.py
+```
+
+The GUI provides:
+- **Configuration Tab**: Set up directories, models, and detection parameters
+- **Processing Tab**: Monitor real-time progress and logs
+- **Results Tab**: View detection results and export reports
+
+### CLI Mode
+
+For automation or headless environments:
+
+```bash
+# Basic usage
+python main_app.py --cli --directory /path/to/cpp/projects
+
+# With custom settings
+python main_app.py --cli \
+ --directory /path/to/cpp/projects \
+ --output results.json \
+ --report report.txt \
+ --n-perturbations 100 \
+ --device cuda
+
+# Show available devices
+python main_app.py --device-info
+```
+
+### CLI Options
+
+```
+--cli Run in CLI mode instead of GUI
+--directory, -d Directory to scan for C++ files (required for CLI)
+--output, -o Output JSON file path
+--report, -r Generate text report at specified path
+--base-model Base model for scoring (default: codellama/CodeLlama-7b-hf)
+--mask-model Mask filling model (default: Salesforce/codet5p-770m)
+--device Device to use: cuda, cpu, or mps (default: auto-detect)
+--n-perturbations Number of perturbations per sample (default: 50)
+--batch-size Batch size for processing (default: 10)
+--min-size Minimum file size in bytes (default: 100)
+--max-size Maximum file size in bytes (default: 100000)
+--max-files Maximum files to process, 0 for all (default: 0)
+--device-info Show device information and exit
+--debug Enable debug mode with full tracebacks
+```
+
+## How It Works
+
+DetectCodeGPT C++ Edition uses a multi-step detection process:
+
+1. **File Scanning**: Recursively scans directories for C++ files (.cpp, .hpp, .h, .cc, etc.)
+2. **Project Organization**: Groups files by project based on build system files (CMakeLists.txt, Makefile, etc.)
+3. **Code Preprocessing**: Normalizes and prepares C++ code for analysis
+4. **Perturbation Generation**: Creates multiple perturbed versions of each code sample
+5. **Likelihood Scoring**: Computes log-likelihood and rank scores using language models
+6. **Detection**: Applies DetectCodeGPT algorithm to classify code as AI-generated or human-written
+
+### Detection Scores
+
+The tool provides three main detection scores:
+
+- **DetectCodeGPT Score (NPR)**: Normalized Perturbation Rank - primary metric
+ - Values > 1.0: Likely AI-generated
+ - Values 0.95-1.0: Possibly AI-generated
+ - Values < 0.95: Likely human-written
+
+- **DetectGPT Score**: Curvature-based detection from original DetectGPT paper
+- **LRR Score**: Log-likelihood to Rank Ratio
+
+## Output Format
+
+### JSON Output
+
+The tool generates a comprehensive JSON file with:
+
+```json
+{
+ "metadata": {
+ "root_path": "/path/to/projects",
+ "timestamp": "2025-11-08T12:00:00",
+ "processing_time_seconds": 123.45,
+ "files_processed": 50,
+ "device_info": {...}
+ },
+ "summary": {
+ "total_analyzed": 50,
+ "likely_ai_generated": 5,
+ "possibly_ai_generated": 3,
+ "likely_human_written": 42,
+ "ai_percentage": 10.0
+ },
+ "projects": {
+ "project_name": [
+ {
+ "file_path": "/full/path/to/file.cpp",
+ "relative_path": "src/file.cpp",
+ "detection": {
+ "detectcodegpt_score": 1.05,
+ "prediction": "likely_ai_generated",
+ "confidence": "high",
+ ...
+ }
+ }
+ ]
+ }
+}
+```
+
+### Text Report
+
+Generate a human-readable summary:
+
+```
+================================================================================
+C++ AI-Generated Code Detection Report
+================================================================================
+
+Root Path: /path/to/projects
+Timestamp: 2025-11-08T12:00:00
+Processing Time: 123.45 seconds
+Device: cuda
+
+Summary:
+ Total Files Analyzed: 50
+ Likely AI-Generated: 5 (10.0%)
+ Possibly AI-Generated: 3
+ Likely Human-Written: 42
+ Errors: 0
+
+Projects:
+ project1:
+ Files: 25
+ Likely AI-Generated: 2
+ Suspicious files:
+ - src/algorithm.cpp (score: 1.0234)
+ - src/utils.cpp (score: 1.0156)
+```
+
+## Supported C++ File Types
+
+- `.cpp` - C++ source files
+- `.cc` - C++ source files (alternative extension)
+- `.cxx` - C++ source files (alternative extension)
+- `.c++` - C++ source files (alternative extension)
+- `.hpp` - C++ header files
+- `.h` - C/C++ header files
+- `.hh` - C++ header files (alternative extension)
+- `.hxx` - C++ header files (alternative extension)
+- `.h++` - C++ header files (alternative extension)
+
+## GPU Support
+
+The tool automatically detects and uses available GPUs:
+
+- **NVIDIA GPUs**: CUDA support (recommended)
+- **Apple Silicon**: Metal Performance Shaders (MPS)
+- **CPU Fallback**: Works without GPU but slower
+
+Check GPU availability:
+
+```bash
+python main_app.py --device-info
+```
+
+## Model Options
+
+### Base Models (for scoring)
+
+- `codellama/CodeLlama-7b-hf` (default, recommended for C++)
+- `codellama/CodeLlama-13b-hf` (larger, more accurate, slower)
+- `Salesforce/codegen-2B-mono`
+- `microsoft/CodeGPT-small-py`
+
+### Mask Filling Models
+
+- `Salesforce/codet5p-770m` (default, recommended)
+- `Salesforce/codet5-base`
+- `Salesforce/CodeT5-large`
+
+## Performance Tips
+
+1. **Use GPU**: Significant speedup with CUDA-capable GPU
+2. **Adjust Batch Size**: Increase for faster processing (if memory allows)
+3. **Reduce Perturbations**: Lower `n-perturbations` for faster but less accurate results
+4. **File Size Limits**: Adjust `min-size` and `max-size` to focus on relevant files
+5. **File Limit**: Use `max-files` for quick testing on large codebases
+
+## Troubleshooting
+
+### Out of Memory Errors
+
+- Reduce `batch-size`
+- Reduce `n-perturbations`
+- Use a smaller base model
+- Process fewer files at once with `max-files`
+
+### Slow Processing
+
+- Ensure GPU is being used (check with `--device-info`)
+- Increase `batch-size` if memory allows
+- Reduce `n-perturbations`
+
+### Import Errors
+
+Make sure all dependencies are installed:
+
+```bash
+pip install -r ../requirements.txt
+```
+
+## Examples
+
+### Scan a single project
+
+```bash
+python main_app.py --cli --directory ~/my_cpp_project --output results.json
+```
+
+### Scan with detailed reporting
+
+```bash
+python main_app.py --cli \
+ --directory ~/cpp_projects \
+ --output results.json \
+ --report detailed_report.txt \
+ --n-perturbations 100
+```
+
+### Quick scan for testing
+
+```bash
+python main_app.py --cli \
+ --directory ~/large_codebase \
+ --max-files 10 \
+ --n-perturbations 25
+```
+
+## License
+
+This project extends DetectCodeGPT and is licensed under the MIT License. See the LICENSE file in the repository root for details.
+
+## Citation
+
+If you use this tool in your research, please cite the original DetectCodeGPT paper:
+
+```bibtex
+@inproceedings{shi2025detectcodegpt,
+ title={Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers},
+ author={Shi, Yuling and Zhang, Hongyu and Wan, Chengcheng and Gu, Xiaodong},
+ booktitle={Proceedings of the 47th International Conference on Software Engineering (ICSE 2025)},
+ year={2025},
+ organization={IEEE}
+}
+```
+
+## Acknowledgements
+
+Based on DetectCodeGPT by Shi et al., with enhancements for C++ code analysis, GPU acceleration, and user interface improvements.
diff --git a/cpp-detector/__init__.py b/cpp-detector/__init__.py
new file mode 100644
index 0000000..7d8008f
--- /dev/null
+++ b/cpp-detector/__init__.py
@@ -0,0 +1,19 @@
+"""
+DetectCodeGPT C++ Edition
+AI-generated C++ code detection with GPU acceleration and GUI.
+"""
+
+__version__ = "1.0.0"
+__author__ = "DetectCodeGPT Contributors"
+
+from .cpp_detector import CppCodeDetector, GPUManager
+from .file_scanner import CppFileScanner, scan_cpp_files
+from .batch_processor import BatchProcessor
+
+__all__ = [
+ 'CppCodeDetector',
+ 'GPUManager',
+ 'CppFileScanner',
+ 'scan_cpp_files',
+ 'BatchProcessor'
+]
diff --git a/cpp-detector/batch_processor.py b/cpp-detector/batch_processor.py
new file mode 100644
index 0000000..1668606
--- /dev/null
+++ b/cpp-detector/batch_processor.py
@@ -0,0 +1,323 @@
+"""
+Batch Processor and JSON Output Module
+Processes multiple C++ files and exports results to JSON.
+"""
+
+import json
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Optional
+from loguru import logger
+from tqdm import tqdm
+
+from file_scanner import scan_cpp_files
+from cpp_detector import CppCodeDetector, GPUManager
+
+
+class BatchProcessor:
+ """Processes multiple C++ files and generates reports."""
+
+ def __init__(self,
+ base_model_name: str = "codellama/CodeLlama-7b-hf",
+ mask_filling_model_name: str = "Salesforce/codet5p-770m",
+ device: Optional[str] = None,
+ batch_size: int = 10,
+ n_perturbations: int = 50):
+ """
+ Initialize the batch processor.
+
+ Args:
+ base_model_name: Base model for scoring
+ mask_filling_model_name: Mask filling model
+ device: Device to use (None for auto-detect)
+ batch_size: Batch size for processing
+ n_perturbations: Number of perturbations per sample
+ """
+ self.detector = CppCodeDetector(
+ base_model_name=base_model_name,
+ mask_filling_model_name=mask_filling_model_name,
+ device=device,
+ batch_size=batch_size,
+ n_perturbations=n_perturbations
+ )
+
+ def process_directory(self,
+ root_path: str,
+ output_path: Optional[str] = None,
+ min_size: int = 100,
+ max_size: int = 100000,
+ max_files: Optional[int] = None) -> Dict:
+ """
+ Process all C++ files in a directory.
+
+ Args:
+ root_path: Root directory to scan
+ output_path: Path to save JSON output (optional)
+ min_size: Minimum file size in bytes
+ max_size: Maximum file size in bytes
+ max_files: Maximum number of files to process (None for all)
+
+ Returns:
+ Dictionary with all results
+ """
+ logger.info(f"Scanning directory: {root_path}")
+
+ # Scan for C++ files
+ scan_result = scan_cpp_files(
+ root_path,
+ organize=True,
+ min_size=min_size,
+ max_size=max_size
+ )
+
+ if scan_result['total_files'] == 0:
+ logger.warning("No C++ files found in directory")
+ return {
+ 'error': 'No C++ files found',
+ 'root_path': root_path,
+ 'timestamp': datetime.now().isoformat()
+ }
+
+ logger.info(f"Found {scan_result['total_files']} C++ files in {scan_result['project_count']} projects")
+
+ # Load models once
+ logger.info("Loading detection models...")
+ self.detector.load_models()
+
+ # Process files
+ start_time = time.time()
+ results_by_project = {}
+
+ files_processed = 0
+ for project_name, files in scan_result['projects'].items():
+ logger.info(f"\nProcessing project: {project_name} ({len(files)} files)")
+
+ project_results = []
+
+ for file_info in tqdm(files, desc=f"Processing {project_name}"):
+ if max_files and files_processed >= max_files:
+ logger.info(f"Reached maximum file limit: {max_files}")
+ break
+
+ try:
+ # Detect AI-generated code
+ detection_result = self.detector.detect_single(file_info['content'])
+
+ # Combine file info and detection result
+ combined_result = {
+ 'file_path': file_info['file_path'],
+ 'relative_path': file_info['relative_path'],
+ 'file_name': file_info['file_name'],
+ 'project': project_name,
+ 'detection': detection_result,
+ 'timestamp': datetime.now().isoformat()
+ }
+
+ project_results.append(combined_result)
+ files_processed += 1
+
+ except Exception as e:
+ logger.error(f"Error processing {file_info['file_path']}: {e}")
+ project_results.append({
+ 'file_path': file_info['file_path'],
+ 'relative_path': file_info['relative_path'],
+ 'file_name': file_info['file_name'],
+ 'project': project_name,
+ 'error': str(e),
+ 'timestamp': datetime.now().isoformat()
+ })
+
+ results_by_project[project_name] = project_results
+
+ if max_files and files_processed >= max_files:
+ break
+
+ processing_time = time.time() - start_time
+
+ # Compile final results
+ final_results = self._compile_results(
+ root_path=root_path,
+ scan_result=scan_result,
+ results_by_project=results_by_project,
+ processing_time=processing_time,
+ files_processed=files_processed
+ )
+
+ # Save to JSON if output path specified
+ if output_path:
+ self.save_json(final_results, output_path)
+
+ # Unload models to free memory
+ self.detector.unload_models()
+
+ return final_results
+
+ def _compile_results(self,
+ root_path: str,
+ scan_result: Dict,
+ results_by_project: Dict,
+ processing_time: float,
+ files_processed: int) -> Dict:
+ """
+ Compile all results into a structured format.
+
+ Args:
+ root_path: Root directory path
+ scan_result: Results from file scanning
+ results_by_project: Detection results organized by project
+ processing_time: Total processing time
+ files_processed: Number of files processed
+
+ Returns:
+ Compiled results dictionary
+ """
+ # Calculate statistics
+ all_detections = []
+ for project_results in results_by_project.values():
+ for result in project_results:
+ if 'detection' in result:
+ all_detections.append(result['detection'])
+
+ ai_generated_count = sum(1 for d in all_detections
+ if d.get('prediction') == 'likely_ai_generated')
+ possibly_ai_count = sum(1 for d in all_detections
+ if d.get('prediction') == 'possibly_ai_generated')
+ human_written_count = sum(1 for d in all_detections
+ if d.get('prediction') == 'likely_human_written')
+
+ # Get device info
+ device_info = GPUManager.get_device_info()
+
+ return {
+ 'metadata': {
+ 'root_path': root_path,
+ 'timestamp': datetime.now().isoformat(),
+ 'processing_time_seconds': round(processing_time, 2),
+ 'files_scanned': scan_result['total_files'],
+ 'files_processed': files_processed,
+ 'projects_count': len(results_by_project),
+ 'device_info': device_info,
+ 'detector_config': {
+ 'base_model': self.detector.base_model_name,
+ 'mask_filling_model': self.detector.mask_filling_model_name,
+ 'device': self.detector.device,
+ 'n_perturbations': self.detector.n_perturbations,
+ 'batch_size': self.detector.batch_size
+ }
+ },
+ 'summary': {
+ 'total_analyzed': len(all_detections),
+ 'likely_ai_generated': ai_generated_count,
+ 'possibly_ai_generated': possibly_ai_count,
+ 'likely_human_written': human_written_count,
+ 'errors': files_processed - len(all_detections),
+ 'ai_percentage': round(100 * ai_generated_count / len(all_detections), 2)
+ if all_detections else 0,
+ },
+ 'projects': results_by_project,
+ 'scan_info': {
+ 'total_files_found': scan_result['total_files'],
+ 'project_count': scan_result['project_count']
+ }
+ }
+
+ @staticmethod
+ def save_json(data: Dict, output_path: str, indent: int = 2):
+ """
+ Save results to JSON file.
+
+ Args:
+ data: Data to save
+ output_path: Output file path
+ indent: JSON indentation level
+ """
+ output_file = Path(output_path)
+ output_file.parent.mkdir(parents=True, exist_ok=True)
+
+ logger.info(f"Saving results to: {output_path}")
+
+ with open(output_file, 'w', encoding='utf-8') as f:
+ json.dump(data, f, indent=indent, ensure_ascii=False)
+
+ logger.info(f"Results saved successfully ({output_file.stat().st_size} bytes)")
+
+ @staticmethod
+ def load_json(input_path: str) -> Dict:
+ """
+ Load results from JSON file.
+
+ Args:
+ input_path: Input file path
+
+ Returns:
+ Loaded data dictionary
+ """
+ logger.info(f"Loading results from: {input_path}")
+
+ with open(input_path, 'r', encoding='utf-8') as f:
+ data = json.load(f)
+
+ return data
+
+ def generate_report(self, results: Dict, output_path: Optional[str] = None) -> str:
+ """
+ Generate a human-readable text report.
+
+ Args:
+ results: Results dictionary
+ output_path: Optional path to save report
+
+ Returns:
+ Report as string
+ """
+ report_lines = []
+ report_lines.append("=" * 80)
+ report_lines.append("C++ AI-Generated Code Detection Report")
+ report_lines.append("=" * 80)
+ report_lines.append("")
+
+ # Metadata
+ metadata = results['metadata']
+ report_lines.append(f"Root Path: {metadata['root_path']}")
+ report_lines.append(f"Timestamp: {metadata['timestamp']}")
+ report_lines.append(f"Processing Time: {metadata['processing_time_seconds']} seconds")
+ report_lines.append(f"Device: {metadata['detector_config']['device']}")
+ report_lines.append("")
+
+ # Summary
+ summary = results['summary']
+ report_lines.append("Summary:")
+ report_lines.append(f" Total Files Analyzed: {summary['total_analyzed']}")
+ report_lines.append(f" Likely AI-Generated: {summary['likely_ai_generated']} ({summary['ai_percentage']}%)")
+ report_lines.append(f" Possibly AI-Generated: {summary['possibly_ai_generated']}")
+ report_lines.append(f" Likely Human-Written: {summary['likely_human_written']}")
+ report_lines.append(f" Errors: {summary['errors']}")
+ report_lines.append("")
+
+ # Projects
+ report_lines.append("Projects:")
+ for project_name, project_results in results['projects'].items():
+ report_lines.append(f"\n {project_name}:")
+
+ ai_count = sum(1 for r in project_results
+ if r.get('detection', {}).get('prediction') == 'likely_ai_generated')
+ report_lines.append(f" Files: {len(project_results)}")
+ report_lines.append(f" Likely AI-Generated: {ai_count}")
+
+ # List suspicious files
+ if ai_count > 0:
+ report_lines.append(" Suspicious files:")
+ for result in project_results:
+ if result.get('detection', {}).get('prediction') == 'likely_ai_generated':
+ score = result['detection'].get('detectcodegpt_score', 0)
+ report_lines.append(f" - {result['relative_path']} (score: {score:.4f})")
+
+ report_text = "\n".join(report_lines)
+
+ if output_path:
+ with open(output_path, 'w', encoding='utf-8') as f:
+ f.write(report_text)
+ logger.info(f"Report saved to: {output_path}")
+
+ return report_text
diff --git a/cpp-detector/cpp_detector.py b/cpp-detector/cpp_detector.py
new file mode 100644
index 0000000..df8e912
--- /dev/null
+++ b/cpp-detector/cpp_detector.py
@@ -0,0 +1,315 @@
+"""
+C++ Code Detector Module
+Adapts DetectCodeGPT to detect AI-generated C++ code.
+"""
+
+import sys
+import os
+import torch
+import numpy as np
+import math
+from pathlib import Path
+from typing import Dict, List, Optional
+from loguru import logger
+from tqdm import tqdm
+import functools
+
+# Add parent directory to path to import from code-detection
+sys.path.insert(0, str(Path(__file__).parent.parent / "code-detection"))
+
+from baselines.utils.loadmodel import load_base_model_and_tokenizer, load_mask_filling_model
+from baselines.utils.run_baseline import get_roc_metrics
+from baselines.loss import get_ll, get_lls
+from baselines.rank import get_rank, get_ranks
+
+
+class GPUManager:
+ """Manages GPU availability and device selection."""
+
+ @staticmethod
+ def get_device() -> str:
+ """
+ Detect and return the best available device.
+
+ Returns:
+ Device string: 'cuda', 'mps' (for Mac), or 'cpu'
+ """
+ if torch.cuda.is_available():
+ gpu_count = torch.cuda.device_count()
+ logger.info(f"CUDA is available with {gpu_count} GPU(s)")
+ logger.info(f"GPU Name: {torch.cuda.get_device_name(0)}")
+ return 'cuda'
+ elif torch.backends.mps.is_available():
+ logger.info("MPS (Metal Performance Shaders) is available")
+ return 'mps'
+ else:
+ logger.warning("No GPU available, using CPU")
+ return 'cpu'
+
+ @staticmethod
+ def get_device_info() -> Dict:
+ """
+ Get detailed information about available devices.
+
+ Returns:
+ Dictionary with device information
+ """
+ info = {
+ 'cuda_available': torch.cuda.is_available(),
+ 'mps_available': torch.backends.mps.is_available(),
+ 'device_count': 0,
+ 'devices': []
+ }
+
+ if torch.cuda.is_available():
+ info['device_count'] = torch.cuda.device_count()
+ for i in range(info['device_count']):
+ device_info = {
+ 'id': i,
+ 'name': torch.cuda.get_device_name(i),
+ 'total_memory': torch.cuda.get_device_properties(i).total_memory,
+ 'capability': torch.cuda.get_device_capability(i)
+ }
+ info['devices'].append(device_info)
+
+ return info
+
+
+class CppCodeDetector:
+ """Detector for AI-generated C++ code using DetectCodeGPT."""
+
+ def __init__(self,
+ base_model_name: str = "codellama/CodeLlama-7b-hf",
+ mask_filling_model_name: str = "Salesforce/codet5p-770m",
+ device: Optional[str] = None,
+ batch_size: int = 10,
+ n_perturbations: int = 50,
+ pct_words_masked: float = 0.5):
+ """
+ Initialize the C++ code detector.
+
+ Args:
+ base_model_name: Name of the base model for scoring
+ mask_filling_model_name: Name of the mask filling model
+ device: Device to use ('cuda', 'mps', 'cpu', or None for auto-detect)
+ batch_size: Batch size for processing
+ n_perturbations: Number of perturbations to generate
+ pct_words_masked: Percentage of words to mask during perturbation
+ """
+ self.base_model_name = base_model_name
+ self.mask_filling_model_name = mask_filling_model_name
+ self.device = device if device else GPUManager.get_device()
+ self.batch_size = batch_size
+ self.n_perturbations = n_perturbations
+ self.pct_words_masked = pct_words_masked
+
+ # Model config will be loaded lazily
+ self.model_config = None
+ self.models_loaded = False
+
+ logger.info(f"Initialized CppCodeDetector with device: {self.device}")
+
+ def load_models(self):
+ """Load the models for detection."""
+ if self.models_loaded:
+ return
+
+ logger.info("Loading models...")
+
+ # Create args object similar to main.py
+ class Args:
+ pass
+
+ args = Args()
+ args.mask_filling_model_name = self.mask_filling_model_name
+ args.base_model_name = self.base_model_name
+ args.DEVICE = self.device
+ args.cache_dir = "~/.cache/huggingface/hub"
+ args.int8 = False
+ args.half = False
+ args.base_half = False
+ args.mask_top_p = 1.0
+ args.mask_temperature = 1.0
+ args.batch_size = self.batch_size
+ args.chunk_size = 10
+ args.span_length = 2
+ args.buffer_size = 1
+ args.pct_words_masked = self.pct_words_masked
+ args.perturb_type = "random-insert-space+newline"
+ args.n_perturbation_rounds = 1
+
+ self.args = args
+ self.model_config = {}
+ self.model_config['cache_dir'] = args.cache_dir
+
+ # Load mask filling model
+ logger.info(f"Loading mask filling model: {self.mask_filling_model_name}")
+ self.model_config = load_mask_filling_model(args, self.mask_filling_model_name, self.model_config)
+
+ # Load base scoring model
+ logger.info(f"Loading base scoring model: {self.base_model_name}")
+ # Move mask model to CPU to free GPU memory
+ self.model_config['mask_model'] = self.model_config['mask_model'].cpu()
+ torch.cuda.empty_cache()
+
+ self.model_config = load_base_model_and_tokenizer(args, self.model_config)
+
+ self.models_loaded = True
+ logger.info("Models loaded successfully")
+
+ def _preprocess_cpp_code(self, code: str, max_tokens: int = 128) -> str:
+ """
+ Preprocess C++ code for detection.
+
+ Args:
+ code: C++ source code
+ max_tokens: Maximum number of tokens to keep
+
+ Returns:
+ Preprocessed code
+ """
+ # Remove excessive whitespace but preserve structure
+ lines = code.split('\n')
+ # Remove empty lines at start and end
+ while lines and not lines[0].strip():
+ lines.pop(0)
+ while lines and not lines[-1].strip():
+ lines.pop()
+
+ code = '\n'.join(lines)
+
+ # Truncate to max_tokens words (rough approximation)
+ words = code.split()
+ if len(words) > max_tokens:
+ code = ' '.join(words[:max_tokens])
+
+ return code
+
+ def detect_single(self, code: str) -> Dict:
+ """
+ Detect if a single code snippet is AI-generated.
+
+ Args:
+ code: C++ code to analyze
+
+ Returns:
+ Dictionary with detection results including scores and probabilities
+ """
+ if not self.models_loaded:
+ self.load_models()
+
+ # Preprocess code
+ code = self._preprocess_cpp_code(code)
+
+ # Import perturbation functions
+ from main import perturb_texts
+
+ # Generate perturbations
+ perturb_fn = functools.partial(perturb_texts, args=self.args, model_config=self.model_config)
+
+ logger.info(f"Generating {self.n_perturbations} perturbations...")
+ perturbed_texts = perturb_fn([code for _ in range(self.n_perturbations)])
+
+ # Calculate log likelihood
+ logger.info("Computing log likelihood...")
+ original_ll = get_ll(code, self.args, self.model_config)
+
+ # Calculate log rank
+ logger.info("Computing log rank...")
+ original_logrank = get_rank(code, self.args, self.model_config, log=True)
+
+ # Calculate perturbed log likelihoods
+ logger.info("Computing perturbed log likelihoods...")
+ perturbed_lls = get_lls(perturbed_texts, self.args, self.model_config)
+ perturbed_lls_clean = [ll for ll in perturbed_lls if not math.isnan(ll)]
+
+ perturbed_ll_mean = np.mean(perturbed_lls_clean) if perturbed_lls_clean else 0
+ perturbed_ll_std = np.std(perturbed_lls_clean) if len(perturbed_lls_clean) > 1 else 1
+
+ # Calculate perturbed log ranks
+ logger.info("Computing perturbed log ranks...")
+ perturbed_logranks = get_ranks(perturbed_texts, self.args, self.model_config, log=True)
+ perturbed_logranks_clean = [lr for lr in perturbed_logranks if not math.isnan(lr)]
+
+ perturbed_logrank_mean = np.mean(perturbed_logranks_clean) if perturbed_logranks_clean else 0
+
+ # Calculate detection scores
+ # DetectGPT score
+ detectgpt_score = (original_ll - perturbed_ll_mean) / perturbed_ll_std if perturbed_ll_std > 0 else 0
+
+ # Log Rank Ratio (LRR)
+ lrr_score = -original_ll / original_logrank if original_logrank != 0 else 0
+
+ # DetectCodeGPT score (NPR - Normalized Perturbation Rank)
+ detectcodegpt_score = perturbed_logrank_mean / original_logrank if original_logrank != 0 else 0
+
+ result = {
+ 'code_length': len(code),
+ 'word_count': len(code.split()),
+ 'line_count': len(code.splitlines()),
+ 'original_ll': float(original_ll),
+ 'original_logrank': float(original_logrank),
+ 'perturbed_ll_mean': float(perturbed_ll_mean),
+ 'perturbed_ll_std': float(perturbed_ll_std),
+ 'perturbed_logrank_mean': float(perturbed_logrank_mean),
+ 'detectgpt_score': float(detectgpt_score),
+ 'lrr_score': float(lrr_score),
+ 'detectcodegpt_score': float(detectcodegpt_score),
+ 'n_perturbations': self.n_perturbations,
+ 'n_valid_perturbations': len(perturbed_lls_clean),
+ }
+
+ # Add interpretation (higher scores suggest AI-generated)
+ # These thresholds are rough estimates and should be calibrated
+ if detectcodegpt_score > 1.0:
+ result['prediction'] = 'likely_ai_generated'
+ result['confidence'] = 'high'
+ elif detectcodegpt_score > 0.95:
+ result['prediction'] = 'possibly_ai_generated'
+ result['confidence'] = 'medium'
+ else:
+ result['prediction'] = 'likely_human_written'
+ result['confidence'] = 'medium'
+
+ return result
+
+ def detect_batch(self, codes: List[str], show_progress: bool = True) -> List[Dict]:
+ """
+ Detect AI-generated code for a batch of code snippets.
+
+ Args:
+ codes: List of C++ code strings
+ show_progress: Whether to show progress bar
+
+ Returns:
+ List of detection result dictionaries
+ """
+ results = []
+
+ iterator = tqdm(codes, desc="Detecting AI-generated code") if show_progress else codes
+
+ for code in iterator:
+ try:
+ result = self.detect_single(code)
+ results.append(result)
+ except Exception as e:
+ logger.error(f"Error detecting code: {e}")
+ results.append({
+ 'error': str(e),
+ 'prediction': 'error',
+ 'confidence': 'none'
+ })
+
+ return results
+
+ def unload_models(self):
+ """Unload models to free memory."""
+ if self.model_config:
+ if 'mask_model' in self.model_config:
+ del self.model_config['mask_model']
+ if 'base_model' in self.model_config:
+ del self.model_config['base_model']
+
+ torch.cuda.empty_cache()
+ self.models_loaded = False
+ logger.info("Models unloaded")
diff --git a/cpp-detector/file_scanner.py b/cpp-detector/file_scanner.py
new file mode 100644
index 0000000..bc418fe
--- /dev/null
+++ b/cpp-detector/file_scanner.py
@@ -0,0 +1,242 @@
+"""
+C++ File Scanner Module
+Recursively scans directories for C++ source files and organizes them by project.
+"""
+
+import os
+from pathlib import Path
+from typing import List, Dict, Set
+from loguru import logger
+
+
+class CppFileScanner:
+ """Scanner for finding C++ files in directory structures."""
+
+ # Common C++ file extensions
+ CPP_EXTENSIONS = {'.cpp', '.cc', '.cxx', '.c++', '.hpp', '.h', '.hh', '.hxx', '.h++'}
+
+ # Directories to skip during scanning
+ SKIP_DIRS = {
+ '__pycache__', '.git', '.svn', '.hg', 'node_modules',
+ 'build', 'dist', '.venv', 'venv', 'env', '.tox',
+ 'CMakeFiles', '.cmake', 'Debug', 'Release'
+ }
+
+ def __init__(self, root_path: str):
+ """
+ Initialize the scanner with a root directory.
+
+ Args:
+ root_path: Root directory to scan for C++ files
+ """
+ self.root_path = Path(root_path).resolve()
+ if not self.root_path.exists():
+ raise ValueError(f"Root path does not exist: {root_path}")
+ if not self.root_path.is_dir():
+ raise ValueError(f"Root path is not a directory: {root_path}")
+
+ def is_cpp_file(self, file_path: Path) -> bool:
+ """
+ Check if a file is a C++ source file.
+
+ Args:
+ file_path: Path to check
+
+ Returns:
+ True if file is a C++ source file
+ """
+ return file_path.suffix.lower() in self.CPP_EXTENSIONS
+
+ def should_skip_dir(self, dir_path: Path) -> bool:
+ """
+ Check if a directory should be skipped during scanning.
+
+ Args:
+ dir_path: Directory to check
+
+ Returns:
+ True if directory should be skipped
+ """
+ return dir_path.name in self.SKIP_DIRS or dir_path.name.startswith('.')
+
+ def scan_directory(self, min_size: int = 100, max_size: int = 1000000) -> List[Dict]:
+ """
+ Recursively scan directory for C++ files.
+
+ Args:
+ min_size: Minimum file size in bytes (default 100)
+ max_size: Maximum file size in bytes (default 1MB)
+
+ Returns:
+ List of dictionaries containing file information
+ """
+ cpp_files = []
+
+ logger.info(f"Scanning directory: {self.root_path}")
+
+ for root, dirs, files in os.walk(self.root_path):
+ root_path = Path(root)
+
+ # Remove directories to skip from the list (modifies in-place)
+ dirs[:] = [d for d in dirs if not self.should_skip_dir(root_path / d)]
+
+ for file in files:
+ file_path = root_path / file
+
+ if not self.is_cpp_file(file_path):
+ continue
+
+ try:
+ file_size = file_path.stat().st_size
+
+ # Skip files that are too small or too large
+ if file_size < min_size or file_size > max_size:
+ logger.debug(f"Skipping {file_path}: size {file_size} bytes")
+ continue
+
+ # Read file content
+ with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
+ content = f.read()
+
+ # Calculate relative path from root
+ rel_path = file_path.relative_to(self.root_path)
+
+ cpp_files.append({
+ 'file_path': str(file_path),
+ 'relative_path': str(rel_path),
+ 'file_name': file_path.name,
+ 'extension': file_path.suffix,
+ 'size': file_size,
+ 'content': content,
+ 'line_count': len(content.splitlines()),
+ })
+
+ except (IOError, OSError) as e:
+ logger.warning(f"Error reading file {file_path}: {e}")
+ continue
+
+ logger.info(f"Found {len(cpp_files)} C++ files")
+ return cpp_files
+
+ def organize_by_project(self, files: List[Dict]) -> Dict[str, List[Dict]]:
+ """
+ Organize files by their project directory.
+
+ A project is identified by:
+ 1. Having a CMakeLists.txt or Makefile
+ 2. Being a direct subdirectory of the root
+ 3. Or fallback to parent directory grouping
+
+ Args:
+ files: List of file dictionaries from scan_directory
+
+ Returns:
+ Dictionary mapping project names to lists of files
+ """
+ projects = {}
+
+ # First, try to identify projects by build files
+ project_roots = self._find_project_roots()
+
+ for file_info in files:
+ file_path = Path(file_info['file_path'])
+
+ # Find which project this file belongs to
+ project_name = self._determine_project(file_path, project_roots)
+
+ if project_name not in projects:
+ projects[project_name] = []
+
+ file_info['project'] = project_name
+ projects[project_name].append(file_info)
+
+ logger.info(f"Organized files into {len(projects)} projects")
+ for project, files_list in projects.items():
+ logger.info(f" {project}: {len(files_list)} files")
+
+ return projects
+
+ def _find_project_roots(self) -> Dict[Path, str]:
+ """
+ Find project root directories by looking for build files.
+
+ Returns:
+ Dictionary mapping project root paths to project names
+ """
+ project_roots = {}
+ build_files = {'CMakeLists.txt', 'Makefile', 'makefile', 'build.gradle',
+ 'pom.xml', 'meson.build', 'configure.ac'}
+
+ for root, dirs, files in os.walk(self.root_path):
+ root_path = Path(root)
+
+ # Skip directories we should ignore
+ dirs[:] = [d for d in dirs if not self.should_skip_dir(root_path / d)]
+
+ # Check if this directory has any build files
+ if any(bf in files for bf in build_files):
+ project_name = root_path.name
+ if root_path == self.root_path:
+ project_name = "root"
+ project_roots[root_path] = project_name
+ logger.debug(f"Found project: {project_name} at {root_path}")
+
+ return project_roots
+
+ def _determine_project(self, file_path: Path, project_roots: Dict[Path, str]) -> str:
+ """
+ Determine which project a file belongs to.
+
+ Args:
+ file_path: Path to the file
+ project_roots: Dictionary of known project roots
+
+ Returns:
+ Project name
+ """
+ # Check if file is in any project root
+ for root, name in project_roots.items():
+ if file_path.is_relative_to(root):
+ return name
+
+ # Fallback: use the immediate subdirectory of root
+ try:
+ rel_path = file_path.relative_to(self.root_path)
+ parts = rel_path.parts
+ if len(parts) > 1:
+ return parts[0]
+ else:
+ return "root"
+ except ValueError:
+ return "unknown"
+
+
+def scan_cpp_files(root_path: str, organize: bool = True,
+ min_size: int = 100, max_size: int = 1000000) -> Dict:
+ """
+ Convenience function to scan for C++ files.
+
+ Args:
+ root_path: Root directory to scan
+ organize: Whether to organize by project (default True)
+ min_size: Minimum file size in bytes
+ max_size: Maximum file size in bytes
+
+ Returns:
+ Dictionary with scan results
+ """
+ scanner = CppFileScanner(root_path)
+ files = scanner.scan_directory(min_size=min_size, max_size=max_size)
+
+ result = {
+ 'root_path': str(scanner.root_path),
+ 'total_files': len(files),
+ 'files': files
+ }
+
+ if organize:
+ projects = scanner.organize_by_project(files)
+ result['projects'] = projects
+ result['project_count'] = len(projects)
+
+ return result
diff --git a/cpp-detector/gui.py b/cpp-detector/gui.py
new file mode 100644
index 0000000..c0b9d41
--- /dev/null
+++ b/cpp-detector/gui.py
@@ -0,0 +1,459 @@
+"""
+GUI Application for C++ AI-Generated Code Detection
+Provides a user-friendly interface for scanning and analyzing C++ projects.
+"""
+
+import tkinter as tk
+from tkinter import ttk, filedialog, scrolledtext, messagebox
+import threading
+import json
+from pathlib import Path
+from datetime import datetime
+from loguru import logger
+import sys
+
+from batch_processor import BatchProcessor
+from cpp_detector import GPUManager
+
+
+class DetectorGUI:
+ """Main GUI application for C++ code detection."""
+
+ def __init__(self, root):
+ """
+ Initialize the GUI.
+
+ Args:
+ root: Tkinter root window
+ """
+ self.root = root
+ self.root.title("DetectCodeGPT - C++ AI Code Detector")
+ self.root.geometry("1000x800")
+
+ # Variables
+ self.directory_var = tk.StringVar()
+ self.output_path_var = tk.StringVar(value="results.json")
+ self.base_model_var = tk.StringVar(value="codellama/CodeLlama-7b-hf")
+ self.mask_model_var = tk.StringVar(value="Salesforce/codet5p-770m")
+ self.device_var = tk.StringVar(value="auto")
+ self.n_perturbations_var = tk.IntVar(value=50)
+ self.batch_size_var = tk.IntVar(value=10)
+ self.min_size_var = tk.IntVar(value=100)
+ self.max_size_var = tk.IntVar(value=100000)
+ self.max_files_var = tk.IntVar(value=0) # 0 means no limit
+
+ self.processing = False
+ self.processor = None
+ self.results = None
+
+ # Setup GUI
+ self._create_widgets()
+ self._setup_logging()
+
+ # Display device info
+ self._display_device_info()
+
+ def _create_widgets(self):
+ """Create all GUI widgets."""
+ # Create notebook for tabs
+ self.notebook = ttk.Notebook(self.root)
+ self.notebook.pack(fill='both', expand=True, padx=5, pady=5)
+
+ # Tab 1: Configuration
+ self.config_frame = ttk.Frame(self.notebook)
+ self.notebook.add(self.config_frame, text='Configuration')
+ self._create_config_tab()
+
+ # Tab 2: Processing
+ self.process_frame = ttk.Frame(self.notebook)
+ self.notebook.add(self.process_frame, text='Processing')
+ self._create_process_tab()
+
+ # Tab 3: Results
+ self.results_frame = ttk.Frame(self.notebook)
+ self.notebook.add(self.results_frame, text='Results')
+ self._create_results_tab()
+
+ # Status bar
+ self.status_bar = ttk.Label(self.root, text="Ready", relief=tk.SUNKEN)
+ self.status_bar.pack(side=tk.BOTTOM, fill=tk.X)
+
+ def _create_config_tab(self):
+ """Create configuration tab widgets."""
+ # Directory selection
+ dir_frame = ttk.LabelFrame(self.config_frame, text="Directory Selection", padding=10)
+ dir_frame.pack(fill='x', padx=10, pady=5)
+
+ ttk.Label(dir_frame, text="C++ Project Directory:").grid(row=0, column=0, sticky='w', pady=5)
+ ttk.Entry(dir_frame, textvariable=self.directory_var, width=60).grid(row=0, column=1, padx=5, pady=5)
+ ttk.Button(dir_frame, text="Browse...", command=self._browse_directory).grid(row=0, column=2, padx=5, pady=5)
+
+ ttk.Label(dir_frame, text="Output JSON Path:").grid(row=1, column=0, sticky='w', pady=5)
+ ttk.Entry(dir_frame, textvariable=self.output_path_var, width=60).grid(row=1, column=1, padx=5, pady=5)
+ ttk.Button(dir_frame, text="Browse...", command=self._browse_output).grid(row=1, column=2, padx=5, pady=5)
+
+ # Model configuration
+ model_frame = ttk.LabelFrame(self.config_frame, text="Model Configuration", padding=10)
+ model_frame.pack(fill='x', padx=10, pady=5)
+
+ ttk.Label(model_frame, text="Base Model:").grid(row=0, column=0, sticky='w', pady=5)
+ base_model_combo = ttk.Combobox(model_frame, textvariable=self.base_model_var, width=57)
+ base_model_combo['values'] = [
+ 'codellama/CodeLlama-7b-hf',
+ 'codellama/CodeLlama-13b-hf',
+ 'Salesforce/codegen-2B-mono',
+ 'microsoft/CodeGPT-small-py'
+ ]
+ base_model_combo.grid(row=0, column=1, padx=5, pady=5, columnspan=2)
+
+ ttk.Label(model_frame, text="Mask Filling Model:").grid(row=1, column=0, sticky='w', pady=5)
+ mask_model_combo = ttk.Combobox(model_frame, textvariable=self.mask_model_var, width=57)
+ mask_model_combo['values'] = [
+ 'Salesforce/codet5p-770m',
+ 'Salesforce/codet5-base',
+ 'Salesforce/CodeT5-large'
+ ]
+ mask_model_combo.grid(row=1, column=1, padx=5, pady=5, columnspan=2)
+
+ ttk.Label(model_frame, text="Device:").grid(row=2, column=0, sticky='w', pady=5)
+ device_combo = ttk.Combobox(model_frame, textvariable=self.device_var, width=20)
+ device_combo['values'] = ['auto', 'cuda', 'cpu', 'mps']
+ device_combo.grid(row=2, column=1, sticky='w', padx=5, pady=5)
+
+ # Detection parameters
+ param_frame = ttk.LabelFrame(self.config_frame, text="Detection Parameters", padding=10)
+ param_frame.pack(fill='x', padx=10, pady=5)
+
+ ttk.Label(param_frame, text="Number of Perturbations:").grid(row=0, column=0, sticky='w', pady=5)
+ ttk.Spinbox(param_frame, from_=10, to=200, textvariable=self.n_perturbations_var, width=20).grid(
+ row=0, column=1, sticky='w', padx=5, pady=5)
+
+ ttk.Label(param_frame, text="Batch Size:").grid(row=1, column=0, sticky='w', pady=5)
+ ttk.Spinbox(param_frame, from_=1, to=50, textvariable=self.batch_size_var, width=20).grid(
+ row=1, column=1, sticky='w', padx=5, pady=5)
+
+ ttk.Label(param_frame, text="Min File Size (bytes):").grid(row=2, column=0, sticky='w', pady=5)
+ ttk.Spinbox(param_frame, from_=0, to=10000, textvariable=self.min_size_var, width=20).grid(
+ row=2, column=1, sticky='w', padx=5, pady=5)
+
+ ttk.Label(param_frame, text="Max File Size (bytes):").grid(row=3, column=0, sticky='w', pady=5)
+ ttk.Spinbox(param_frame, from_=1000, to=10000000, textvariable=self.max_size_var, width=20).grid(
+ row=3, column=1, sticky='w', padx=5, pady=5)
+
+ ttk.Label(param_frame, text="Max Files to Process (0=all):").grid(row=4, column=0, sticky='w', pady=5)
+ ttk.Spinbox(param_frame, from_=0, to=10000, textvariable=self.max_files_var, width=20).grid(
+ row=4, column=1, sticky='w', padx=5, pady=5)
+
+ # Device info
+ self.device_info_frame = ttk.LabelFrame(self.config_frame, text="Device Information", padding=10)
+ self.device_info_frame.pack(fill='both', expand=True, padx=10, pady=5)
+
+ self.device_info_text = scrolledtext.ScrolledText(self.device_info_frame, height=6, wrap=tk.WORD)
+ self.device_info_text.pack(fill='both', expand=True)
+
+ def _create_process_tab(self):
+ """Create processing tab widgets."""
+ # Control buttons
+ button_frame = ttk.Frame(self.process_frame)
+ button_frame.pack(fill='x', padx=10, pady=10)
+
+ self.start_button = ttk.Button(button_frame, text="Start Detection", command=self._start_detection,
+ style='Accent.TButton')
+ self.start_button.pack(side=tk.LEFT, padx=5)
+
+ self.stop_button = ttk.Button(button_frame, text="Stop", command=self._stop_detection, state='disabled')
+ self.stop_button.pack(side=tk.LEFT, padx=5)
+
+ ttk.Button(button_frame, text="Clear Log", command=self._clear_log).pack(side=tk.LEFT, padx=5)
+
+ # Progress bar
+ progress_frame = ttk.Frame(self.process_frame)
+ progress_frame.pack(fill='x', padx=10, pady=5)
+
+ ttk.Label(progress_frame, text="Progress:").pack(side=tk.LEFT, padx=5)
+ self.progress_bar = ttk.Progressbar(progress_frame, mode='indeterminate')
+ self.progress_bar.pack(side=tk.LEFT, fill='x', expand=True, padx=5)
+
+ # Log output
+ log_frame = ttk.LabelFrame(self.process_frame, text="Processing Log", padding=5)
+ log_frame.pack(fill='both', expand=True, padx=10, pady=5)
+
+ self.log_text = scrolledtext.ScrolledText(log_frame, wrap=tk.WORD, state='disabled')
+ self.log_text.pack(fill='both', expand=True)
+
+ def _create_results_tab(self):
+ """Create results tab widgets."""
+ # Results control buttons
+ button_frame = ttk.Frame(self.results_frame)
+ button_frame.pack(fill='x', padx=10, pady=10)
+
+ ttk.Button(button_frame, text="Load Results", command=self._load_results).pack(side=tk.LEFT, padx=5)
+ ttk.Button(button_frame, text="Export Report", command=self._export_report).pack(side=tk.LEFT, padx=5)
+ ttk.Button(button_frame, text="Clear", command=self._clear_results).pack(side=tk.LEFT, padx=5)
+
+ # Results display
+ results_display_frame = ttk.LabelFrame(self.results_frame, text="Detection Results", padding=5)
+ results_display_frame.pack(fill='both', expand=True, padx=10, pady=5)
+
+ self.results_text = scrolledtext.ScrolledText(results_display_frame, wrap=tk.WORD, state='disabled')
+ self.results_text.pack(fill='both', expand=True)
+
+ def _browse_directory(self):
+ """Browse for directory to scan."""
+ directory = filedialog.askdirectory(title="Select C++ Project Directory")
+ if directory:
+ self.directory_var.set(directory)
+
+ def _browse_output(self):
+ """Browse for output file path."""
+ filepath = filedialog.asksaveasfilename(
+ title="Save Results As",
+ defaultextension=".json",
+ filetypes=[("JSON files", "*.json"), ("All files", "*.*")]
+ )
+ if filepath:
+ self.output_path_var.set(filepath)
+
+ def _display_device_info(self):
+ """Display device information."""
+ device_info = GPUManager.get_device_info()
+ info_text = f"CUDA Available: {device_info['cuda_available']}\n"
+ info_text += f"MPS Available: {device_info['mps_available']}\n"
+ info_text += f"Device Count: {device_info['device_count']}\n\n"
+
+ if device_info['devices']:
+ info_text += "Available GPUs:\n"
+ for device in device_info['devices']:
+ info_text += f" GPU {device['id']}: {device['name']}\n"
+ info_text += f" Memory: {device['total_memory'] / (1024**3):.2f} GB\n"
+ info_text += f" Capability: {device['capability']}\n"
+ else:
+ info_text += "No GPUs available. Will use CPU.\n"
+
+ self.device_info_text.delete(1.0, tk.END)
+ self.device_info_text.insert(1.0, info_text)
+
+ def _setup_logging(self):
+ """Setup logging to GUI."""
+ # Create a custom handler to redirect logs to GUI
+ class GUILogHandler:
+ def __init__(self, text_widget, root):
+ self.text_widget = text_widget
+ self.root = root
+
+ def write(self, message):
+ if message.strip(): # Avoid empty lines
+ self.text_widget.configure(state='normal')
+ self.text_widget.insert(tk.END, message + '\n')
+ self.text_widget.see(tk.END)
+ self.text_widget.configure(state='disabled')
+ self.root.update_idletasks()
+
+ self.gui_log_handler = GUILogHandler(self.log_text, self.root)
+
+ def _log_message(self, message):
+ """Log a message to the GUI."""
+ self.log_text.configure(state='normal')
+ self.log_text.insert(tk.END, f"[{datetime.now().strftime('%H:%M:%S')}] {message}\n")
+ self.log_text.see(tk.END)
+ self.log_text.configure(state='disabled')
+ self.root.update_idletasks()
+
+ def _clear_log(self):
+ """Clear the log display."""
+ self.log_text.configure(state='normal')
+ self.log_text.delete(1.0, tk.END)
+ self.log_text.configure(state='disabled')
+
+ def _start_detection(self):
+ """Start the detection process."""
+ # Validate inputs
+ directory = self.directory_var.get()
+ if not directory or not Path(directory).exists():
+ messagebox.showerror("Error", "Please select a valid directory")
+ return
+
+ output_path = self.output_path_var.get()
+ if not output_path:
+ messagebox.showerror("Error", "Please specify an output path")
+ return
+
+ # Update UI state
+ self.processing = True
+ self.start_button.config(state='disabled')
+ self.stop_button.config(state='normal')
+ self.progress_bar.start()
+ self.status_bar.config(text="Processing...")
+
+ # Start processing in a separate thread
+ thread = threading.Thread(target=self._run_detection, daemon=True)
+ thread.start()
+
+ def _run_detection(self):
+ """Run detection process (called in separate thread)."""
+ try:
+ self._log_message("Starting detection process...")
+
+ # Get device
+ device = None if self.device_var.get() == 'auto' else self.device_var.get()
+
+ # Create processor
+ self.processor = BatchProcessor(
+ base_model_name=self.base_model_var.get(),
+ mask_filling_model_name=self.mask_model_var.get(),
+ device=device,
+ batch_size=self.batch_size_var.get(),
+ n_perturbations=self.n_perturbations_var.get()
+ )
+
+ # Process directory
+ max_files = self.max_files_var.get() if self.max_files_var.get() > 0 else None
+
+ self.results = self.processor.process_directory(
+ root_path=self.directory_var.get(),
+ output_path=self.output_path_var.get(),
+ min_size=self.min_size_var.get(),
+ max_size=self.max_size_var.get(),
+ max_files=max_files
+ )
+
+ self._log_message("Detection complete!")
+ self._log_message(f"Results saved to: {self.output_path_var.get()}")
+
+ # Display results
+ self.root.after(0, self._display_results)
+
+ # Show completion message
+ self.root.after(0, lambda: messagebox.showinfo("Complete", "Detection completed successfully!"))
+
+ except Exception as e:
+ logger.error(f"Error during detection: {e}")
+ self._log_message(f"ERROR: {str(e)}")
+ self.root.after(0, lambda: messagebox.showerror("Error", f"Detection failed: {str(e)}"))
+
+ finally:
+ # Update UI state
+ self.root.after(0, self._finish_detection)
+
+ def _stop_detection(self):
+ """Stop the detection process."""
+ # Note: This is a simple implementation. For proper cancellation,
+ # you'd need to implement cooperative cancellation in the processor
+ self.processing = False
+ self._log_message("Stopping detection...")
+ self._finish_detection()
+
+ def _finish_detection(self):
+ """Clean up after detection finishes."""
+ self.processing = False
+ self.start_button.config(state='normal')
+ self.stop_button.config(state='disabled')
+ self.progress_bar.stop()
+ self.status_bar.config(text="Ready")
+
+ def _display_results(self):
+ """Display results in the results tab."""
+ if not self.results:
+ return
+
+ self.results_text.configure(state='normal')
+ self.results_text.delete(1.0, tk.END)
+
+ # Format and display results
+ summary = self.results.get('summary', {})
+ metadata = self.results.get('metadata', {})
+
+ output = "=" * 80 + "\n"
+ output += "DETECTION RESULTS SUMMARY\n"
+ output += "=" * 80 + "\n\n"
+
+ output += f"Root Path: {metadata.get('root_path', 'N/A')}\n"
+ output += f"Timestamp: {metadata.get('timestamp', 'N/A')}\n"
+ output += f"Processing Time: {metadata.get('processing_time_seconds', 0)} seconds\n"
+ output += f"Device: {metadata.get('detector_config', {}).get('device', 'N/A')}\n\n"
+
+ output += "Summary:\n"
+ output += f" Total Files Analyzed: {summary.get('total_analyzed', 0)}\n"
+ output += f" Likely AI-Generated: {summary.get('likely_ai_generated', 0)}\n"
+ output += f" Possibly AI-Generated: {summary.get('possibly_ai_generated', 0)}\n"
+ output += f" Likely Human-Written: {summary.get('likely_human_written', 0)}\n"
+ output += f" AI Percentage: {summary.get('ai_percentage', 0)}%\n\n"
+
+ # Projects
+ output += "Projects:\n"
+ for project_name, project_results in self.results.get('projects', {}).items():
+ ai_count = sum(1 for r in project_results
+ if r.get('detection', {}).get('prediction') == 'likely_ai_generated')
+ output += f"\n {project_name}: {len(project_results)} files, {ai_count} likely AI-generated\n"
+
+ # List suspicious files
+ for result in project_results:
+ if result.get('detection', {}).get('prediction') == 'likely_ai_generated':
+ score = result.get('detection', {}).get('detectcodegpt_score', 0)
+ output += f" - {result['relative_path']} (score: {score:.4f})\n"
+
+ self.results_text.insert(1.0, output)
+ self.results_text.configure(state='disabled')
+
+ # Switch to results tab
+ self.notebook.select(self.results_frame)
+
+ def _load_results(self):
+ """Load results from a JSON file."""
+ filepath = filedialog.askopenfilename(
+ title="Load Results",
+ filetypes=[("JSON files", "*.json"), ("All files", "*.*")]
+ )
+
+ if filepath:
+ try:
+ with open(filepath, 'r', encoding='utf-8') as f:
+ self.results = json.load(f)
+ self._display_results()
+ messagebox.showinfo("Success", "Results loaded successfully")
+ except Exception as e:
+ messagebox.showerror("Error", f"Failed to load results: {str(e)}")
+
+ def _export_report(self):
+ """Export a text report."""
+ if not self.results:
+ messagebox.showwarning("Warning", "No results to export")
+ return
+
+ filepath = filedialog.asksaveasfilename(
+ title="Export Report",
+ defaultextension=".txt",
+ filetypes=[("Text files", "*.txt"), ("All files", "*.*")]
+ )
+
+ if filepath:
+ try:
+ processor = BatchProcessor()
+ report = processor.generate_report(self.results, output_path=filepath)
+ messagebox.showinfo("Success", f"Report exported to {filepath}")
+ except Exception as e:
+ messagebox.showerror("Error", f"Failed to export report: {str(e)}")
+
+ def _clear_results(self):
+ """Clear the results display."""
+ self.results_text.configure(state='normal')
+ self.results_text.delete(1.0, tk.END)
+ self.results_text.configure(state='disabled')
+ self.results = None
+
+
+def main():
+ """Main entry point for GUI application."""
+ root = tk.Tk()
+
+ # Set style
+ style = ttk.Style()
+ style.theme_use('clam')
+
+ # Create and run GUI
+ app = DetectorGUI(root)
+ root.mainloop()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/cpp-detector/main_app.py b/cpp-detector/main_app.py
new file mode 100755
index 0000000..ac306d8
--- /dev/null
+++ b/cpp-detector/main_app.py
@@ -0,0 +1,220 @@
+#!/usr/bin/env python3
+"""
+DetectCodeGPT C++ Edition - Main Application Entry Point
+Supports both GUI and CLI modes for detecting AI-generated C++ code.
+"""
+
+import sys
+import argparse
+from pathlib import Path
+from loguru import logger
+
+# Configure logger
+logger.remove() # Remove default handler
+logger.add(sys.stderr, format="{time:HH:mm:ss} | {level: <8} | {message}")
+
+
+def run_gui():
+ """Run the GUI application."""
+ try:
+ from gui import main as gui_main
+ logger.info("Starting GUI application...")
+ gui_main()
+ except ImportError as e:
+ logger.error(f"Failed to import GUI module: {e}")
+ logger.error("Make sure tkinter is installed: pip install tk")
+ sys.exit(1)
+
+
+def run_cli(args):
+ """
+ Run the CLI application.
+
+ Args:
+ args: Parsed command-line arguments
+ """
+ from batch_processor import BatchProcessor
+
+ if not args.directory:
+ logger.error("Directory path is required for CLI mode")
+ sys.exit(1)
+
+ directory = Path(args.directory)
+ if not directory.exists():
+ logger.error(f"Directory does not exist: {args.directory}")
+ sys.exit(1)
+
+ # Create output path if not specified
+ if not args.output:
+ timestamp = Path(directory).name
+ args.output = f"results_{timestamp}.json"
+
+ logger.info("=" * 80)
+ logger.info("DetectCodeGPT C++ Edition - CLI Mode")
+ logger.info("=" * 80)
+ logger.info(f"Directory: {args.directory}")
+ logger.info(f"Output: {args.output}")
+ logger.info(f"Base Model: {args.base_model}")
+ logger.info(f"Mask Model: {args.mask_model}")
+ logger.info(f"Device: {args.device if args.device else 'auto'}")
+ logger.info(f"Perturbations: {args.n_perturbations}")
+ logger.info(f"Batch Size: {args.batch_size}")
+ logger.info("=" * 80)
+
+ # Create processor
+ processor = BatchProcessor(
+ base_model_name=args.base_model,
+ mask_filling_model_name=args.mask_model,
+ device=args.device,
+ batch_size=args.batch_size,
+ n_perturbations=args.n_perturbations
+ )
+
+ # Process directory
+ try:
+ max_files = args.max_files if args.max_files > 0 else None
+
+ results = processor.process_directory(
+ root_path=args.directory,
+ output_path=args.output,
+ min_size=args.min_size,
+ max_size=args.max_size,
+ max_files=max_files
+ )
+
+ # Display summary
+ logger.info("\n" + "=" * 80)
+ logger.info("DETECTION COMPLETE")
+ logger.info("=" * 80)
+
+ summary = results.get('summary', {})
+ logger.info(f"Total Files Analyzed: {summary.get('total_analyzed', 0)}")
+ logger.info(f"Likely AI-Generated: {summary.get('likely_ai_generated', 0)}")
+ logger.info(f"Possibly AI-Generated: {summary.get('possibly_ai_generated', 0)}")
+ logger.info(f"Likely Human-Written: {summary.get('likely_human_written', 0)}")
+ logger.info(f"AI Percentage: {summary.get('ai_percentage', 0)}%")
+ logger.info(f"\nResults saved to: {args.output}")
+
+ # Generate text report if requested
+ if args.report:
+ report = processor.generate_report(results, output_path=args.report)
+ logger.info(f"Report saved to: {args.report}")
+
+ logger.info("=" * 80)
+
+ except Exception as e:
+ logger.error(f"Detection failed: {e}")
+ if args.debug:
+ raise
+ sys.exit(1)
+
+
+def show_device_info():
+ """Display information about available devices."""
+ from cpp_detector import GPUManager
+
+ logger.info("=" * 80)
+ logger.info("DEVICE INFORMATION")
+ logger.info("=" * 80)
+
+ device_info = GPUManager.get_device_info()
+
+ logger.info(f"CUDA Available: {device_info['cuda_available']}")
+ logger.info(f"MPS Available: {device_info['mps_available']}")
+ logger.info(f"Device Count: {device_info['device_count']}")
+
+ if device_info['devices']:
+ logger.info("\nAvailable GPUs:")
+ for device in device_info['devices']:
+ logger.info(f" GPU {device['id']}: {device['name']}")
+ logger.info(f" Total Memory: {device['total_memory'] / (1024**3):.2f} GB")
+ logger.info(f" Compute Capability: {device['capability']}")
+ else:
+ logger.info("\nNo GPUs available. Will use CPU for computation.")
+
+ logger.info(f"\nRecommended Device: {GPUManager.get_device()}")
+ logger.info("=" * 80)
+
+
+def main():
+ """Main entry point."""
+ parser = argparse.ArgumentParser(
+ description="DetectCodeGPT C++ Edition - Detect AI-generated C++ code",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog="""
+Examples:
+ # Launch GUI
+ python main_app.py
+
+ # CLI mode - scan directory
+ python main_app.py --cli --directory /path/to/cpp/project
+
+ # CLI mode with custom settings
+ python main_app.py --cli --directory /path/to/cpp/project \\
+ --output results.json --n-perturbations 100 --device cuda
+
+ # Show device information
+ python main_app.py --device-info
+ """
+ )
+
+ parser.add_argument('--cli', action='store_true',
+ help='Run in CLI mode instead of GUI')
+
+ parser.add_argument('--directory', '-d', type=str,
+ help='Directory to scan for C++ files (required for CLI mode)')
+
+ parser.add_argument('--output', '-o', type=str,
+ help='Output JSON file path (default: results_.json)')
+
+ parser.add_argument('--report', '-r', type=str,
+ help='Generate text report at specified path')
+
+ parser.add_argument('--base-model', type=str,
+ default='codellama/CodeLlama-7b-hf',
+ help='Base model for scoring (default: codellama/CodeLlama-7b-hf)')
+
+ parser.add_argument('--mask-model', type=str,
+ default='Salesforce/codet5p-770m',
+ help='Mask filling model (default: Salesforce/codet5p-770m)')
+
+ parser.add_argument('--device', type=str, choices=['cuda', 'cpu', 'mps'],
+ help='Device to use (default: auto-detect)')
+
+ parser.add_argument('--n-perturbations', type=int, default=50,
+ help='Number of perturbations per sample (default: 50)')
+
+ parser.add_argument('--batch-size', type=int, default=10,
+ help='Batch size for processing (default: 10)')
+
+ parser.add_argument('--min-size', type=int, default=100,
+ help='Minimum file size in bytes (default: 100)')
+
+ parser.add_argument('--max-size', type=int, default=100000,
+ help='Maximum file size in bytes (default: 100000)')
+
+ parser.add_argument('--max-files', type=int, default=0,
+ help='Maximum number of files to process, 0 for all (default: 0)')
+
+ parser.add_argument('--device-info', action='store_true',
+ help='Show device information and exit')
+
+ parser.add_argument('--debug', action='store_true',
+ help='Enable debug mode with full tracebacks')
+
+ args = parser.parse_args()
+
+ # Show device info if requested
+ if args.device_info:
+ show_device_info()
+ return
+
+ # Run in appropriate mode
+ if args.cli:
+ run_cli(args)
+ else:
+ run_gui()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/cpp-detector/requirements.txt b/cpp-detector/requirements.txt
new file mode 100644
index 0000000..8c343e1
--- /dev/null
+++ b/cpp-detector/requirements.txt
@@ -0,0 +1,35 @@
+# DetectCodeGPT C++ Edition - Requirements
+# These are the same as the main project requirements
+
+# Core deep learning and AI libraries
+torch>=2.0.0
+transformers>=4.30.0
+accelerate>=0.20.0
+datasets>=2.12.0
+openai
+
+# Scientific computing
+numpy>=1.24.0
+scipy>=1.10.0
+scikit-learn>=1.2.0
+
+# Data processing and analysis
+pandas>=2.0.0
+matplotlib>=3.7.0
+seaborn>=0.12.0
+
+# Progress bars and logging
+tqdm>=4.65.0
+loguru>=0.7.0
+
+# Code parsing
+tree-sitter>=0.20.0
+
+# Web requests (for API calls)
+requests>=2.31.0
+
+# GUI support
+# tkinter is usually included with Python, but if not:
+# Linux: sudo apt-get install python3-tk
+# macOS: included with Python
+# Windows: included with Python
diff --git a/requirements.txt b/requirements.txt
index 0af96ed..f95d9fd 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -24,3 +24,9 @@ tree-sitter
# Web requests (for API calls)
requests
+
+# GUI support (for cpp-detector)
+# Note: tkinter is usually included with Python, but listed here for completeness
+# On Linux: sudo apt-get install python3-tk
+# On macOS: included with Python
+# On Windows: included with Python