diff --git a/README.md b/README.md index e517ef6..47f8631 100644 --- a/README.md +++ b/README.md @@ -10,10 +10,32 @@ How can we develop zero-shot detection of machine generated codes? Welcome to the repository for the research paper: **"Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers."** Our paper has been accepted to the 47th International Conference on Software Engineering (**ICSE 2025**). +## NEW: C++ Edition with GUI + +We've created an enhanced version of DetectCodeGPT specifically for analyzing C++ projects! Features include: + +- 🚀 **GPU Acceleration**: Automatic GPU detection (CUDA/MPS) for faster processing +- 🖥️ **User-Friendly GUI**: Intuitive interface for configuration and monitoring +- 📁 **Multi-Project Support**: Scan entire directories with multiple C++ projects +- 📊 **Comprehensive Reporting**: JSON output with detailed scores and human-readable reports +- ⚡ **CLI Mode**: Command-line interface for automation + +**Quick Start:** + +```bash +cd cpp-detector +python main_app.py # Launch GUI +# OR +python main_app.py --cli --directory /path/to/cpp/projects # CLI mode +``` + +See [cpp-detector/README.md](cpp-detector/README.md) for detailed documentation. + ## Table of Contents - [DetectCodeGPT](#detectcodegpt) - [Table of Contents](#table-of-contents) + - [NEW: C++ Edition with GUI](#new-c-edition-with-gui) - [Getting Started](#getting-started) - [Prerequisites](#prerequisites) - [Data Preparation](#data-preparation) diff --git a/cpp-detector/README.md b/cpp-detector/README.md new file mode 100644 index 0000000..db2063b --- /dev/null +++ b/cpp-detector/README.md @@ -0,0 +1,302 @@ +# DetectCodeGPT C++ Edition + +An enhanced version of DetectCodeGPT specifically designed to detect AI-generated C++ code across multiple projects. Features GPU acceleration, a user-friendly GUI, and comprehensive JSON reporting. + +## Features + +- **Multi-Project Support**: Scan entire directories containing multiple C++ projects +- **GPU Acceleration**: Automatically detects and uses available GPUs (CUDA, MPS) for faster processing +- **User-Friendly GUI**: Intuitive interface for configuration and real-time progress monitoring +- **CLI Mode**: Command-line interface for automation and scripting +- **Comprehensive Output**: JSON results with detailed scores and human-readable reports +- **Flexible Configuration**: Customizable detection parameters and model selection + +## Installation + +### Prerequisites + +- Python 3.9.7 or higher +- CUDA-capable GPU (optional, but recommended for better performance) + +### Install Dependencies + +From the repository root: + +```bash +pip install -r requirements.txt +``` + +Or install specific requirements for the C++ detector: + +```bash +cd cpp-detector +pip install -r requirements.txt +``` + +## Usage + +### GUI Mode (Recommended) + +Launch the graphical interface: + +```bash +cd cpp-detector +python main_app.py +``` + +The GUI provides: +- **Configuration Tab**: Set up directories, models, and detection parameters +- **Processing Tab**: Monitor real-time progress and logs +- **Results Tab**: View detection results and export reports + +### CLI Mode + +For automation or headless environments: + +```bash +# Basic usage +python main_app.py --cli --directory /path/to/cpp/projects + +# With custom settings +python main_app.py --cli \ + --directory /path/to/cpp/projects \ + --output results.json \ + --report report.txt \ + --n-perturbations 100 \ + --device cuda + +# Show available devices +python main_app.py --device-info +``` + +### CLI Options + +``` +--cli Run in CLI mode instead of GUI +--directory, -d Directory to scan for C++ files (required for CLI) +--output, -o Output JSON file path +--report, -r Generate text report at specified path +--base-model Base model for scoring (default: codellama/CodeLlama-7b-hf) +--mask-model Mask filling model (default: Salesforce/codet5p-770m) +--device Device to use: cuda, cpu, or mps (default: auto-detect) +--n-perturbations Number of perturbations per sample (default: 50) +--batch-size Batch size for processing (default: 10) +--min-size Minimum file size in bytes (default: 100) +--max-size Maximum file size in bytes (default: 100000) +--max-files Maximum files to process, 0 for all (default: 0) +--device-info Show device information and exit +--debug Enable debug mode with full tracebacks +``` + +## How It Works + +DetectCodeGPT C++ Edition uses a multi-step detection process: + +1. **File Scanning**: Recursively scans directories for C++ files (.cpp, .hpp, .h, .cc, etc.) +2. **Project Organization**: Groups files by project based on build system files (CMakeLists.txt, Makefile, etc.) +3. **Code Preprocessing**: Normalizes and prepares C++ code for analysis +4. **Perturbation Generation**: Creates multiple perturbed versions of each code sample +5. **Likelihood Scoring**: Computes log-likelihood and rank scores using language models +6. **Detection**: Applies DetectCodeGPT algorithm to classify code as AI-generated or human-written + +### Detection Scores + +The tool provides three main detection scores: + +- **DetectCodeGPT Score (NPR)**: Normalized Perturbation Rank - primary metric + - Values > 1.0: Likely AI-generated + - Values 0.95-1.0: Possibly AI-generated + - Values < 0.95: Likely human-written + +- **DetectGPT Score**: Curvature-based detection from original DetectGPT paper +- **LRR Score**: Log-likelihood to Rank Ratio + +## Output Format + +### JSON Output + +The tool generates a comprehensive JSON file with: + +```json +{ + "metadata": { + "root_path": "/path/to/projects", + "timestamp": "2025-11-08T12:00:00", + "processing_time_seconds": 123.45, + "files_processed": 50, + "device_info": {...} + }, + "summary": { + "total_analyzed": 50, + "likely_ai_generated": 5, + "possibly_ai_generated": 3, + "likely_human_written": 42, + "ai_percentage": 10.0 + }, + "projects": { + "project_name": [ + { + "file_path": "/full/path/to/file.cpp", + "relative_path": "src/file.cpp", + "detection": { + "detectcodegpt_score": 1.05, + "prediction": "likely_ai_generated", + "confidence": "high", + ... + } + } + ] + } +} +``` + +### Text Report + +Generate a human-readable summary: + +``` +================================================================================ +C++ AI-Generated Code Detection Report +================================================================================ + +Root Path: /path/to/projects +Timestamp: 2025-11-08T12:00:00 +Processing Time: 123.45 seconds +Device: cuda + +Summary: + Total Files Analyzed: 50 + Likely AI-Generated: 5 (10.0%) + Possibly AI-Generated: 3 + Likely Human-Written: 42 + Errors: 0 + +Projects: + project1: + Files: 25 + Likely AI-Generated: 2 + Suspicious files: + - src/algorithm.cpp (score: 1.0234) + - src/utils.cpp (score: 1.0156) +``` + +## Supported C++ File Types + +- `.cpp` - C++ source files +- `.cc` - C++ source files (alternative extension) +- `.cxx` - C++ source files (alternative extension) +- `.c++` - C++ source files (alternative extension) +- `.hpp` - C++ header files +- `.h` - C/C++ header files +- `.hh` - C++ header files (alternative extension) +- `.hxx` - C++ header files (alternative extension) +- `.h++` - C++ header files (alternative extension) + +## GPU Support + +The tool automatically detects and uses available GPUs: + +- **NVIDIA GPUs**: CUDA support (recommended) +- **Apple Silicon**: Metal Performance Shaders (MPS) +- **CPU Fallback**: Works without GPU but slower + +Check GPU availability: + +```bash +python main_app.py --device-info +``` + +## Model Options + +### Base Models (for scoring) + +- `codellama/CodeLlama-7b-hf` (default, recommended for C++) +- `codellama/CodeLlama-13b-hf` (larger, more accurate, slower) +- `Salesforce/codegen-2B-mono` +- `microsoft/CodeGPT-small-py` + +### Mask Filling Models + +- `Salesforce/codet5p-770m` (default, recommended) +- `Salesforce/codet5-base` +- `Salesforce/CodeT5-large` + +## Performance Tips + +1. **Use GPU**: Significant speedup with CUDA-capable GPU +2. **Adjust Batch Size**: Increase for faster processing (if memory allows) +3. **Reduce Perturbations**: Lower `n-perturbations` for faster but less accurate results +4. **File Size Limits**: Adjust `min-size` and `max-size` to focus on relevant files +5. **File Limit**: Use `max-files` for quick testing on large codebases + +## Troubleshooting + +### Out of Memory Errors + +- Reduce `batch-size` +- Reduce `n-perturbations` +- Use a smaller base model +- Process fewer files at once with `max-files` + +### Slow Processing + +- Ensure GPU is being used (check with `--device-info`) +- Increase `batch-size` if memory allows +- Reduce `n-perturbations` + +### Import Errors + +Make sure all dependencies are installed: + +```bash +pip install -r ../requirements.txt +``` + +## Examples + +### Scan a single project + +```bash +python main_app.py --cli --directory ~/my_cpp_project --output results.json +``` + +### Scan with detailed reporting + +```bash +python main_app.py --cli \ + --directory ~/cpp_projects \ + --output results.json \ + --report detailed_report.txt \ + --n-perturbations 100 +``` + +### Quick scan for testing + +```bash +python main_app.py --cli \ + --directory ~/large_codebase \ + --max-files 10 \ + --n-perturbations 25 +``` + +## License + +This project extends DetectCodeGPT and is licensed under the MIT License. See the LICENSE file in the repository root for details. + +## Citation + +If you use this tool in your research, please cite the original DetectCodeGPT paper: + +```bibtex +@inproceedings{shi2025detectcodegpt, + title={Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers}, + author={Shi, Yuling and Zhang, Hongyu and Wan, Chengcheng and Gu, Xiaodong}, + booktitle={Proceedings of the 47th International Conference on Software Engineering (ICSE 2025)}, + year={2025}, + organization={IEEE} +} +``` + +## Acknowledgements + +Based on DetectCodeGPT by Shi et al., with enhancements for C++ code analysis, GPU acceleration, and user interface improvements. diff --git a/cpp-detector/__init__.py b/cpp-detector/__init__.py new file mode 100644 index 0000000..7d8008f --- /dev/null +++ b/cpp-detector/__init__.py @@ -0,0 +1,19 @@ +""" +DetectCodeGPT C++ Edition +AI-generated C++ code detection with GPU acceleration and GUI. +""" + +__version__ = "1.0.0" +__author__ = "DetectCodeGPT Contributors" + +from .cpp_detector import CppCodeDetector, GPUManager +from .file_scanner import CppFileScanner, scan_cpp_files +from .batch_processor import BatchProcessor + +__all__ = [ + 'CppCodeDetector', + 'GPUManager', + 'CppFileScanner', + 'scan_cpp_files', + 'BatchProcessor' +] diff --git a/cpp-detector/batch_processor.py b/cpp-detector/batch_processor.py new file mode 100644 index 0000000..1668606 --- /dev/null +++ b/cpp-detector/batch_processor.py @@ -0,0 +1,323 @@ +""" +Batch Processor and JSON Output Module +Processes multiple C++ files and exports results to JSON. +""" + +import json +import time +from datetime import datetime +from pathlib import Path +from typing import Dict, List, Optional +from loguru import logger +from tqdm import tqdm + +from file_scanner import scan_cpp_files +from cpp_detector import CppCodeDetector, GPUManager + + +class BatchProcessor: + """Processes multiple C++ files and generates reports.""" + + def __init__(self, + base_model_name: str = "codellama/CodeLlama-7b-hf", + mask_filling_model_name: str = "Salesforce/codet5p-770m", + device: Optional[str] = None, + batch_size: int = 10, + n_perturbations: int = 50): + """ + Initialize the batch processor. + + Args: + base_model_name: Base model for scoring + mask_filling_model_name: Mask filling model + device: Device to use (None for auto-detect) + batch_size: Batch size for processing + n_perturbations: Number of perturbations per sample + """ + self.detector = CppCodeDetector( + base_model_name=base_model_name, + mask_filling_model_name=mask_filling_model_name, + device=device, + batch_size=batch_size, + n_perturbations=n_perturbations + ) + + def process_directory(self, + root_path: str, + output_path: Optional[str] = None, + min_size: int = 100, + max_size: int = 100000, + max_files: Optional[int] = None) -> Dict: + """ + Process all C++ files in a directory. + + Args: + root_path: Root directory to scan + output_path: Path to save JSON output (optional) + min_size: Minimum file size in bytes + max_size: Maximum file size in bytes + max_files: Maximum number of files to process (None for all) + + Returns: + Dictionary with all results + """ + logger.info(f"Scanning directory: {root_path}") + + # Scan for C++ files + scan_result = scan_cpp_files( + root_path, + organize=True, + min_size=min_size, + max_size=max_size + ) + + if scan_result['total_files'] == 0: + logger.warning("No C++ files found in directory") + return { + 'error': 'No C++ files found', + 'root_path': root_path, + 'timestamp': datetime.now().isoformat() + } + + logger.info(f"Found {scan_result['total_files']} C++ files in {scan_result['project_count']} projects") + + # Load models once + logger.info("Loading detection models...") + self.detector.load_models() + + # Process files + start_time = time.time() + results_by_project = {} + + files_processed = 0 + for project_name, files in scan_result['projects'].items(): + logger.info(f"\nProcessing project: {project_name} ({len(files)} files)") + + project_results = [] + + for file_info in tqdm(files, desc=f"Processing {project_name}"): + if max_files and files_processed >= max_files: + logger.info(f"Reached maximum file limit: {max_files}") + break + + try: + # Detect AI-generated code + detection_result = self.detector.detect_single(file_info['content']) + + # Combine file info and detection result + combined_result = { + 'file_path': file_info['file_path'], + 'relative_path': file_info['relative_path'], + 'file_name': file_info['file_name'], + 'project': project_name, + 'detection': detection_result, + 'timestamp': datetime.now().isoformat() + } + + project_results.append(combined_result) + files_processed += 1 + + except Exception as e: + logger.error(f"Error processing {file_info['file_path']}: {e}") + project_results.append({ + 'file_path': file_info['file_path'], + 'relative_path': file_info['relative_path'], + 'file_name': file_info['file_name'], + 'project': project_name, + 'error': str(e), + 'timestamp': datetime.now().isoformat() + }) + + results_by_project[project_name] = project_results + + if max_files and files_processed >= max_files: + break + + processing_time = time.time() - start_time + + # Compile final results + final_results = self._compile_results( + root_path=root_path, + scan_result=scan_result, + results_by_project=results_by_project, + processing_time=processing_time, + files_processed=files_processed + ) + + # Save to JSON if output path specified + if output_path: + self.save_json(final_results, output_path) + + # Unload models to free memory + self.detector.unload_models() + + return final_results + + def _compile_results(self, + root_path: str, + scan_result: Dict, + results_by_project: Dict, + processing_time: float, + files_processed: int) -> Dict: + """ + Compile all results into a structured format. + + Args: + root_path: Root directory path + scan_result: Results from file scanning + results_by_project: Detection results organized by project + processing_time: Total processing time + files_processed: Number of files processed + + Returns: + Compiled results dictionary + """ + # Calculate statistics + all_detections = [] + for project_results in results_by_project.values(): + for result in project_results: + if 'detection' in result: + all_detections.append(result['detection']) + + ai_generated_count = sum(1 for d in all_detections + if d.get('prediction') == 'likely_ai_generated') + possibly_ai_count = sum(1 for d in all_detections + if d.get('prediction') == 'possibly_ai_generated') + human_written_count = sum(1 for d in all_detections + if d.get('prediction') == 'likely_human_written') + + # Get device info + device_info = GPUManager.get_device_info() + + return { + 'metadata': { + 'root_path': root_path, + 'timestamp': datetime.now().isoformat(), + 'processing_time_seconds': round(processing_time, 2), + 'files_scanned': scan_result['total_files'], + 'files_processed': files_processed, + 'projects_count': len(results_by_project), + 'device_info': device_info, + 'detector_config': { + 'base_model': self.detector.base_model_name, + 'mask_filling_model': self.detector.mask_filling_model_name, + 'device': self.detector.device, + 'n_perturbations': self.detector.n_perturbations, + 'batch_size': self.detector.batch_size + } + }, + 'summary': { + 'total_analyzed': len(all_detections), + 'likely_ai_generated': ai_generated_count, + 'possibly_ai_generated': possibly_ai_count, + 'likely_human_written': human_written_count, + 'errors': files_processed - len(all_detections), + 'ai_percentage': round(100 * ai_generated_count / len(all_detections), 2) + if all_detections else 0, + }, + 'projects': results_by_project, + 'scan_info': { + 'total_files_found': scan_result['total_files'], + 'project_count': scan_result['project_count'] + } + } + + @staticmethod + def save_json(data: Dict, output_path: str, indent: int = 2): + """ + Save results to JSON file. + + Args: + data: Data to save + output_path: Output file path + indent: JSON indentation level + """ + output_file = Path(output_path) + output_file.parent.mkdir(parents=True, exist_ok=True) + + logger.info(f"Saving results to: {output_path}") + + with open(output_file, 'w', encoding='utf-8') as f: + json.dump(data, f, indent=indent, ensure_ascii=False) + + logger.info(f"Results saved successfully ({output_file.stat().st_size} bytes)") + + @staticmethod + def load_json(input_path: str) -> Dict: + """ + Load results from JSON file. + + Args: + input_path: Input file path + + Returns: + Loaded data dictionary + """ + logger.info(f"Loading results from: {input_path}") + + with open(input_path, 'r', encoding='utf-8') as f: + data = json.load(f) + + return data + + def generate_report(self, results: Dict, output_path: Optional[str] = None) -> str: + """ + Generate a human-readable text report. + + Args: + results: Results dictionary + output_path: Optional path to save report + + Returns: + Report as string + """ + report_lines = [] + report_lines.append("=" * 80) + report_lines.append("C++ AI-Generated Code Detection Report") + report_lines.append("=" * 80) + report_lines.append("") + + # Metadata + metadata = results['metadata'] + report_lines.append(f"Root Path: {metadata['root_path']}") + report_lines.append(f"Timestamp: {metadata['timestamp']}") + report_lines.append(f"Processing Time: {metadata['processing_time_seconds']} seconds") + report_lines.append(f"Device: {metadata['detector_config']['device']}") + report_lines.append("") + + # Summary + summary = results['summary'] + report_lines.append("Summary:") + report_lines.append(f" Total Files Analyzed: {summary['total_analyzed']}") + report_lines.append(f" Likely AI-Generated: {summary['likely_ai_generated']} ({summary['ai_percentage']}%)") + report_lines.append(f" Possibly AI-Generated: {summary['possibly_ai_generated']}") + report_lines.append(f" Likely Human-Written: {summary['likely_human_written']}") + report_lines.append(f" Errors: {summary['errors']}") + report_lines.append("") + + # Projects + report_lines.append("Projects:") + for project_name, project_results in results['projects'].items(): + report_lines.append(f"\n {project_name}:") + + ai_count = sum(1 for r in project_results + if r.get('detection', {}).get('prediction') == 'likely_ai_generated') + report_lines.append(f" Files: {len(project_results)}") + report_lines.append(f" Likely AI-Generated: {ai_count}") + + # List suspicious files + if ai_count > 0: + report_lines.append(" Suspicious files:") + for result in project_results: + if result.get('detection', {}).get('prediction') == 'likely_ai_generated': + score = result['detection'].get('detectcodegpt_score', 0) + report_lines.append(f" - {result['relative_path']} (score: {score:.4f})") + + report_text = "\n".join(report_lines) + + if output_path: + with open(output_path, 'w', encoding='utf-8') as f: + f.write(report_text) + logger.info(f"Report saved to: {output_path}") + + return report_text diff --git a/cpp-detector/cpp_detector.py b/cpp-detector/cpp_detector.py new file mode 100644 index 0000000..df8e912 --- /dev/null +++ b/cpp-detector/cpp_detector.py @@ -0,0 +1,315 @@ +""" +C++ Code Detector Module +Adapts DetectCodeGPT to detect AI-generated C++ code. +""" + +import sys +import os +import torch +import numpy as np +import math +from pathlib import Path +from typing import Dict, List, Optional +from loguru import logger +from tqdm import tqdm +import functools + +# Add parent directory to path to import from code-detection +sys.path.insert(0, str(Path(__file__).parent.parent / "code-detection")) + +from baselines.utils.loadmodel import load_base_model_and_tokenizer, load_mask_filling_model +from baselines.utils.run_baseline import get_roc_metrics +from baselines.loss import get_ll, get_lls +from baselines.rank import get_rank, get_ranks + + +class GPUManager: + """Manages GPU availability and device selection.""" + + @staticmethod + def get_device() -> str: + """ + Detect and return the best available device. + + Returns: + Device string: 'cuda', 'mps' (for Mac), or 'cpu' + """ + if torch.cuda.is_available(): + gpu_count = torch.cuda.device_count() + logger.info(f"CUDA is available with {gpu_count} GPU(s)") + logger.info(f"GPU Name: {torch.cuda.get_device_name(0)}") + return 'cuda' + elif torch.backends.mps.is_available(): + logger.info("MPS (Metal Performance Shaders) is available") + return 'mps' + else: + logger.warning("No GPU available, using CPU") + return 'cpu' + + @staticmethod + def get_device_info() -> Dict: + """ + Get detailed information about available devices. + + Returns: + Dictionary with device information + """ + info = { + 'cuda_available': torch.cuda.is_available(), + 'mps_available': torch.backends.mps.is_available(), + 'device_count': 0, + 'devices': [] + } + + if torch.cuda.is_available(): + info['device_count'] = torch.cuda.device_count() + for i in range(info['device_count']): + device_info = { + 'id': i, + 'name': torch.cuda.get_device_name(i), + 'total_memory': torch.cuda.get_device_properties(i).total_memory, + 'capability': torch.cuda.get_device_capability(i) + } + info['devices'].append(device_info) + + return info + + +class CppCodeDetector: + """Detector for AI-generated C++ code using DetectCodeGPT.""" + + def __init__(self, + base_model_name: str = "codellama/CodeLlama-7b-hf", + mask_filling_model_name: str = "Salesforce/codet5p-770m", + device: Optional[str] = None, + batch_size: int = 10, + n_perturbations: int = 50, + pct_words_masked: float = 0.5): + """ + Initialize the C++ code detector. + + Args: + base_model_name: Name of the base model for scoring + mask_filling_model_name: Name of the mask filling model + device: Device to use ('cuda', 'mps', 'cpu', or None for auto-detect) + batch_size: Batch size for processing + n_perturbations: Number of perturbations to generate + pct_words_masked: Percentage of words to mask during perturbation + """ + self.base_model_name = base_model_name + self.mask_filling_model_name = mask_filling_model_name + self.device = device if device else GPUManager.get_device() + self.batch_size = batch_size + self.n_perturbations = n_perturbations + self.pct_words_masked = pct_words_masked + + # Model config will be loaded lazily + self.model_config = None + self.models_loaded = False + + logger.info(f"Initialized CppCodeDetector with device: {self.device}") + + def load_models(self): + """Load the models for detection.""" + if self.models_loaded: + return + + logger.info("Loading models...") + + # Create args object similar to main.py + class Args: + pass + + args = Args() + args.mask_filling_model_name = self.mask_filling_model_name + args.base_model_name = self.base_model_name + args.DEVICE = self.device + args.cache_dir = "~/.cache/huggingface/hub" + args.int8 = False + args.half = False + args.base_half = False + args.mask_top_p = 1.0 + args.mask_temperature = 1.0 + args.batch_size = self.batch_size + args.chunk_size = 10 + args.span_length = 2 + args.buffer_size = 1 + args.pct_words_masked = self.pct_words_masked + args.perturb_type = "random-insert-space+newline" + args.n_perturbation_rounds = 1 + + self.args = args + self.model_config = {} + self.model_config['cache_dir'] = args.cache_dir + + # Load mask filling model + logger.info(f"Loading mask filling model: {self.mask_filling_model_name}") + self.model_config = load_mask_filling_model(args, self.mask_filling_model_name, self.model_config) + + # Load base scoring model + logger.info(f"Loading base scoring model: {self.base_model_name}") + # Move mask model to CPU to free GPU memory + self.model_config['mask_model'] = self.model_config['mask_model'].cpu() + torch.cuda.empty_cache() + + self.model_config = load_base_model_and_tokenizer(args, self.model_config) + + self.models_loaded = True + logger.info("Models loaded successfully") + + def _preprocess_cpp_code(self, code: str, max_tokens: int = 128) -> str: + """ + Preprocess C++ code for detection. + + Args: + code: C++ source code + max_tokens: Maximum number of tokens to keep + + Returns: + Preprocessed code + """ + # Remove excessive whitespace but preserve structure + lines = code.split('\n') + # Remove empty lines at start and end + while lines and not lines[0].strip(): + lines.pop(0) + while lines and not lines[-1].strip(): + lines.pop() + + code = '\n'.join(lines) + + # Truncate to max_tokens words (rough approximation) + words = code.split() + if len(words) > max_tokens: + code = ' '.join(words[:max_tokens]) + + return code + + def detect_single(self, code: str) -> Dict: + """ + Detect if a single code snippet is AI-generated. + + Args: + code: C++ code to analyze + + Returns: + Dictionary with detection results including scores and probabilities + """ + if not self.models_loaded: + self.load_models() + + # Preprocess code + code = self._preprocess_cpp_code(code) + + # Import perturbation functions + from main import perturb_texts + + # Generate perturbations + perturb_fn = functools.partial(perturb_texts, args=self.args, model_config=self.model_config) + + logger.info(f"Generating {self.n_perturbations} perturbations...") + perturbed_texts = perturb_fn([code for _ in range(self.n_perturbations)]) + + # Calculate log likelihood + logger.info("Computing log likelihood...") + original_ll = get_ll(code, self.args, self.model_config) + + # Calculate log rank + logger.info("Computing log rank...") + original_logrank = get_rank(code, self.args, self.model_config, log=True) + + # Calculate perturbed log likelihoods + logger.info("Computing perturbed log likelihoods...") + perturbed_lls = get_lls(perturbed_texts, self.args, self.model_config) + perturbed_lls_clean = [ll for ll in perturbed_lls if not math.isnan(ll)] + + perturbed_ll_mean = np.mean(perturbed_lls_clean) if perturbed_lls_clean else 0 + perturbed_ll_std = np.std(perturbed_lls_clean) if len(perturbed_lls_clean) > 1 else 1 + + # Calculate perturbed log ranks + logger.info("Computing perturbed log ranks...") + perturbed_logranks = get_ranks(perturbed_texts, self.args, self.model_config, log=True) + perturbed_logranks_clean = [lr for lr in perturbed_logranks if not math.isnan(lr)] + + perturbed_logrank_mean = np.mean(perturbed_logranks_clean) if perturbed_logranks_clean else 0 + + # Calculate detection scores + # DetectGPT score + detectgpt_score = (original_ll - perturbed_ll_mean) / perturbed_ll_std if perturbed_ll_std > 0 else 0 + + # Log Rank Ratio (LRR) + lrr_score = -original_ll / original_logrank if original_logrank != 0 else 0 + + # DetectCodeGPT score (NPR - Normalized Perturbation Rank) + detectcodegpt_score = perturbed_logrank_mean / original_logrank if original_logrank != 0 else 0 + + result = { + 'code_length': len(code), + 'word_count': len(code.split()), + 'line_count': len(code.splitlines()), + 'original_ll': float(original_ll), + 'original_logrank': float(original_logrank), + 'perturbed_ll_mean': float(perturbed_ll_mean), + 'perturbed_ll_std': float(perturbed_ll_std), + 'perturbed_logrank_mean': float(perturbed_logrank_mean), + 'detectgpt_score': float(detectgpt_score), + 'lrr_score': float(lrr_score), + 'detectcodegpt_score': float(detectcodegpt_score), + 'n_perturbations': self.n_perturbations, + 'n_valid_perturbations': len(perturbed_lls_clean), + } + + # Add interpretation (higher scores suggest AI-generated) + # These thresholds are rough estimates and should be calibrated + if detectcodegpt_score > 1.0: + result['prediction'] = 'likely_ai_generated' + result['confidence'] = 'high' + elif detectcodegpt_score > 0.95: + result['prediction'] = 'possibly_ai_generated' + result['confidence'] = 'medium' + else: + result['prediction'] = 'likely_human_written' + result['confidence'] = 'medium' + + return result + + def detect_batch(self, codes: List[str], show_progress: bool = True) -> List[Dict]: + """ + Detect AI-generated code for a batch of code snippets. + + Args: + codes: List of C++ code strings + show_progress: Whether to show progress bar + + Returns: + List of detection result dictionaries + """ + results = [] + + iterator = tqdm(codes, desc="Detecting AI-generated code") if show_progress else codes + + for code in iterator: + try: + result = self.detect_single(code) + results.append(result) + except Exception as e: + logger.error(f"Error detecting code: {e}") + results.append({ + 'error': str(e), + 'prediction': 'error', + 'confidence': 'none' + }) + + return results + + def unload_models(self): + """Unload models to free memory.""" + if self.model_config: + if 'mask_model' in self.model_config: + del self.model_config['mask_model'] + if 'base_model' in self.model_config: + del self.model_config['base_model'] + + torch.cuda.empty_cache() + self.models_loaded = False + logger.info("Models unloaded") diff --git a/cpp-detector/file_scanner.py b/cpp-detector/file_scanner.py new file mode 100644 index 0000000..bc418fe --- /dev/null +++ b/cpp-detector/file_scanner.py @@ -0,0 +1,242 @@ +""" +C++ File Scanner Module +Recursively scans directories for C++ source files and organizes them by project. +""" + +import os +from pathlib import Path +from typing import List, Dict, Set +from loguru import logger + + +class CppFileScanner: + """Scanner for finding C++ files in directory structures.""" + + # Common C++ file extensions + CPP_EXTENSIONS = {'.cpp', '.cc', '.cxx', '.c++', '.hpp', '.h', '.hh', '.hxx', '.h++'} + + # Directories to skip during scanning + SKIP_DIRS = { + '__pycache__', '.git', '.svn', '.hg', 'node_modules', + 'build', 'dist', '.venv', 'venv', 'env', '.tox', + 'CMakeFiles', '.cmake', 'Debug', 'Release' + } + + def __init__(self, root_path: str): + """ + Initialize the scanner with a root directory. + + Args: + root_path: Root directory to scan for C++ files + """ + self.root_path = Path(root_path).resolve() + if not self.root_path.exists(): + raise ValueError(f"Root path does not exist: {root_path}") + if not self.root_path.is_dir(): + raise ValueError(f"Root path is not a directory: {root_path}") + + def is_cpp_file(self, file_path: Path) -> bool: + """ + Check if a file is a C++ source file. + + Args: + file_path: Path to check + + Returns: + True if file is a C++ source file + """ + return file_path.suffix.lower() in self.CPP_EXTENSIONS + + def should_skip_dir(self, dir_path: Path) -> bool: + """ + Check if a directory should be skipped during scanning. + + Args: + dir_path: Directory to check + + Returns: + True if directory should be skipped + """ + return dir_path.name in self.SKIP_DIRS or dir_path.name.startswith('.') + + def scan_directory(self, min_size: int = 100, max_size: int = 1000000) -> List[Dict]: + """ + Recursively scan directory for C++ files. + + Args: + min_size: Minimum file size in bytes (default 100) + max_size: Maximum file size in bytes (default 1MB) + + Returns: + List of dictionaries containing file information + """ + cpp_files = [] + + logger.info(f"Scanning directory: {self.root_path}") + + for root, dirs, files in os.walk(self.root_path): + root_path = Path(root) + + # Remove directories to skip from the list (modifies in-place) + dirs[:] = [d for d in dirs if not self.should_skip_dir(root_path / d)] + + for file in files: + file_path = root_path / file + + if not self.is_cpp_file(file_path): + continue + + try: + file_size = file_path.stat().st_size + + # Skip files that are too small or too large + if file_size < min_size or file_size > max_size: + logger.debug(f"Skipping {file_path}: size {file_size} bytes") + continue + + # Read file content + with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: + content = f.read() + + # Calculate relative path from root + rel_path = file_path.relative_to(self.root_path) + + cpp_files.append({ + 'file_path': str(file_path), + 'relative_path': str(rel_path), + 'file_name': file_path.name, + 'extension': file_path.suffix, + 'size': file_size, + 'content': content, + 'line_count': len(content.splitlines()), + }) + + except (IOError, OSError) as e: + logger.warning(f"Error reading file {file_path}: {e}") + continue + + logger.info(f"Found {len(cpp_files)} C++ files") + return cpp_files + + def organize_by_project(self, files: List[Dict]) -> Dict[str, List[Dict]]: + """ + Organize files by their project directory. + + A project is identified by: + 1. Having a CMakeLists.txt or Makefile + 2. Being a direct subdirectory of the root + 3. Or fallback to parent directory grouping + + Args: + files: List of file dictionaries from scan_directory + + Returns: + Dictionary mapping project names to lists of files + """ + projects = {} + + # First, try to identify projects by build files + project_roots = self._find_project_roots() + + for file_info in files: + file_path = Path(file_info['file_path']) + + # Find which project this file belongs to + project_name = self._determine_project(file_path, project_roots) + + if project_name not in projects: + projects[project_name] = [] + + file_info['project'] = project_name + projects[project_name].append(file_info) + + logger.info(f"Organized files into {len(projects)} projects") + for project, files_list in projects.items(): + logger.info(f" {project}: {len(files_list)} files") + + return projects + + def _find_project_roots(self) -> Dict[Path, str]: + """ + Find project root directories by looking for build files. + + Returns: + Dictionary mapping project root paths to project names + """ + project_roots = {} + build_files = {'CMakeLists.txt', 'Makefile', 'makefile', 'build.gradle', + 'pom.xml', 'meson.build', 'configure.ac'} + + for root, dirs, files in os.walk(self.root_path): + root_path = Path(root) + + # Skip directories we should ignore + dirs[:] = [d for d in dirs if not self.should_skip_dir(root_path / d)] + + # Check if this directory has any build files + if any(bf in files for bf in build_files): + project_name = root_path.name + if root_path == self.root_path: + project_name = "root" + project_roots[root_path] = project_name + logger.debug(f"Found project: {project_name} at {root_path}") + + return project_roots + + def _determine_project(self, file_path: Path, project_roots: Dict[Path, str]) -> str: + """ + Determine which project a file belongs to. + + Args: + file_path: Path to the file + project_roots: Dictionary of known project roots + + Returns: + Project name + """ + # Check if file is in any project root + for root, name in project_roots.items(): + if file_path.is_relative_to(root): + return name + + # Fallback: use the immediate subdirectory of root + try: + rel_path = file_path.relative_to(self.root_path) + parts = rel_path.parts + if len(parts) > 1: + return parts[0] + else: + return "root" + except ValueError: + return "unknown" + + +def scan_cpp_files(root_path: str, organize: bool = True, + min_size: int = 100, max_size: int = 1000000) -> Dict: + """ + Convenience function to scan for C++ files. + + Args: + root_path: Root directory to scan + organize: Whether to organize by project (default True) + min_size: Minimum file size in bytes + max_size: Maximum file size in bytes + + Returns: + Dictionary with scan results + """ + scanner = CppFileScanner(root_path) + files = scanner.scan_directory(min_size=min_size, max_size=max_size) + + result = { + 'root_path': str(scanner.root_path), + 'total_files': len(files), + 'files': files + } + + if organize: + projects = scanner.organize_by_project(files) + result['projects'] = projects + result['project_count'] = len(projects) + + return result diff --git a/cpp-detector/gui.py b/cpp-detector/gui.py new file mode 100644 index 0000000..c0b9d41 --- /dev/null +++ b/cpp-detector/gui.py @@ -0,0 +1,459 @@ +""" +GUI Application for C++ AI-Generated Code Detection +Provides a user-friendly interface for scanning and analyzing C++ projects. +""" + +import tkinter as tk +from tkinter import ttk, filedialog, scrolledtext, messagebox +import threading +import json +from pathlib import Path +from datetime import datetime +from loguru import logger +import sys + +from batch_processor import BatchProcessor +from cpp_detector import GPUManager + + +class DetectorGUI: + """Main GUI application for C++ code detection.""" + + def __init__(self, root): + """ + Initialize the GUI. + + Args: + root: Tkinter root window + """ + self.root = root + self.root.title("DetectCodeGPT - C++ AI Code Detector") + self.root.geometry("1000x800") + + # Variables + self.directory_var = tk.StringVar() + self.output_path_var = tk.StringVar(value="results.json") + self.base_model_var = tk.StringVar(value="codellama/CodeLlama-7b-hf") + self.mask_model_var = tk.StringVar(value="Salesforce/codet5p-770m") + self.device_var = tk.StringVar(value="auto") + self.n_perturbations_var = tk.IntVar(value=50) + self.batch_size_var = tk.IntVar(value=10) + self.min_size_var = tk.IntVar(value=100) + self.max_size_var = tk.IntVar(value=100000) + self.max_files_var = tk.IntVar(value=0) # 0 means no limit + + self.processing = False + self.processor = None + self.results = None + + # Setup GUI + self._create_widgets() + self._setup_logging() + + # Display device info + self._display_device_info() + + def _create_widgets(self): + """Create all GUI widgets.""" + # Create notebook for tabs + self.notebook = ttk.Notebook(self.root) + self.notebook.pack(fill='both', expand=True, padx=5, pady=5) + + # Tab 1: Configuration + self.config_frame = ttk.Frame(self.notebook) + self.notebook.add(self.config_frame, text='Configuration') + self._create_config_tab() + + # Tab 2: Processing + self.process_frame = ttk.Frame(self.notebook) + self.notebook.add(self.process_frame, text='Processing') + self._create_process_tab() + + # Tab 3: Results + self.results_frame = ttk.Frame(self.notebook) + self.notebook.add(self.results_frame, text='Results') + self._create_results_tab() + + # Status bar + self.status_bar = ttk.Label(self.root, text="Ready", relief=tk.SUNKEN) + self.status_bar.pack(side=tk.BOTTOM, fill=tk.X) + + def _create_config_tab(self): + """Create configuration tab widgets.""" + # Directory selection + dir_frame = ttk.LabelFrame(self.config_frame, text="Directory Selection", padding=10) + dir_frame.pack(fill='x', padx=10, pady=5) + + ttk.Label(dir_frame, text="C++ Project Directory:").grid(row=0, column=0, sticky='w', pady=5) + ttk.Entry(dir_frame, textvariable=self.directory_var, width=60).grid(row=0, column=1, padx=5, pady=5) + ttk.Button(dir_frame, text="Browse...", command=self._browse_directory).grid(row=0, column=2, padx=5, pady=5) + + ttk.Label(dir_frame, text="Output JSON Path:").grid(row=1, column=0, sticky='w', pady=5) + ttk.Entry(dir_frame, textvariable=self.output_path_var, width=60).grid(row=1, column=1, padx=5, pady=5) + ttk.Button(dir_frame, text="Browse...", command=self._browse_output).grid(row=1, column=2, padx=5, pady=5) + + # Model configuration + model_frame = ttk.LabelFrame(self.config_frame, text="Model Configuration", padding=10) + model_frame.pack(fill='x', padx=10, pady=5) + + ttk.Label(model_frame, text="Base Model:").grid(row=0, column=0, sticky='w', pady=5) + base_model_combo = ttk.Combobox(model_frame, textvariable=self.base_model_var, width=57) + base_model_combo['values'] = [ + 'codellama/CodeLlama-7b-hf', + 'codellama/CodeLlama-13b-hf', + 'Salesforce/codegen-2B-mono', + 'microsoft/CodeGPT-small-py' + ] + base_model_combo.grid(row=0, column=1, padx=5, pady=5, columnspan=2) + + ttk.Label(model_frame, text="Mask Filling Model:").grid(row=1, column=0, sticky='w', pady=5) + mask_model_combo = ttk.Combobox(model_frame, textvariable=self.mask_model_var, width=57) + mask_model_combo['values'] = [ + 'Salesforce/codet5p-770m', + 'Salesforce/codet5-base', + 'Salesforce/CodeT5-large' + ] + mask_model_combo.grid(row=1, column=1, padx=5, pady=5, columnspan=2) + + ttk.Label(model_frame, text="Device:").grid(row=2, column=0, sticky='w', pady=5) + device_combo = ttk.Combobox(model_frame, textvariable=self.device_var, width=20) + device_combo['values'] = ['auto', 'cuda', 'cpu', 'mps'] + device_combo.grid(row=2, column=1, sticky='w', padx=5, pady=5) + + # Detection parameters + param_frame = ttk.LabelFrame(self.config_frame, text="Detection Parameters", padding=10) + param_frame.pack(fill='x', padx=10, pady=5) + + ttk.Label(param_frame, text="Number of Perturbations:").grid(row=0, column=0, sticky='w', pady=5) + ttk.Spinbox(param_frame, from_=10, to=200, textvariable=self.n_perturbations_var, width=20).grid( + row=0, column=1, sticky='w', padx=5, pady=5) + + ttk.Label(param_frame, text="Batch Size:").grid(row=1, column=0, sticky='w', pady=5) + ttk.Spinbox(param_frame, from_=1, to=50, textvariable=self.batch_size_var, width=20).grid( + row=1, column=1, sticky='w', padx=5, pady=5) + + ttk.Label(param_frame, text="Min File Size (bytes):").grid(row=2, column=0, sticky='w', pady=5) + ttk.Spinbox(param_frame, from_=0, to=10000, textvariable=self.min_size_var, width=20).grid( + row=2, column=1, sticky='w', padx=5, pady=5) + + ttk.Label(param_frame, text="Max File Size (bytes):").grid(row=3, column=0, sticky='w', pady=5) + ttk.Spinbox(param_frame, from_=1000, to=10000000, textvariable=self.max_size_var, width=20).grid( + row=3, column=1, sticky='w', padx=5, pady=5) + + ttk.Label(param_frame, text="Max Files to Process (0=all):").grid(row=4, column=0, sticky='w', pady=5) + ttk.Spinbox(param_frame, from_=0, to=10000, textvariable=self.max_files_var, width=20).grid( + row=4, column=1, sticky='w', padx=5, pady=5) + + # Device info + self.device_info_frame = ttk.LabelFrame(self.config_frame, text="Device Information", padding=10) + self.device_info_frame.pack(fill='both', expand=True, padx=10, pady=5) + + self.device_info_text = scrolledtext.ScrolledText(self.device_info_frame, height=6, wrap=tk.WORD) + self.device_info_text.pack(fill='both', expand=True) + + def _create_process_tab(self): + """Create processing tab widgets.""" + # Control buttons + button_frame = ttk.Frame(self.process_frame) + button_frame.pack(fill='x', padx=10, pady=10) + + self.start_button = ttk.Button(button_frame, text="Start Detection", command=self._start_detection, + style='Accent.TButton') + self.start_button.pack(side=tk.LEFT, padx=5) + + self.stop_button = ttk.Button(button_frame, text="Stop", command=self._stop_detection, state='disabled') + self.stop_button.pack(side=tk.LEFT, padx=5) + + ttk.Button(button_frame, text="Clear Log", command=self._clear_log).pack(side=tk.LEFT, padx=5) + + # Progress bar + progress_frame = ttk.Frame(self.process_frame) + progress_frame.pack(fill='x', padx=10, pady=5) + + ttk.Label(progress_frame, text="Progress:").pack(side=tk.LEFT, padx=5) + self.progress_bar = ttk.Progressbar(progress_frame, mode='indeterminate') + self.progress_bar.pack(side=tk.LEFT, fill='x', expand=True, padx=5) + + # Log output + log_frame = ttk.LabelFrame(self.process_frame, text="Processing Log", padding=5) + log_frame.pack(fill='both', expand=True, padx=10, pady=5) + + self.log_text = scrolledtext.ScrolledText(log_frame, wrap=tk.WORD, state='disabled') + self.log_text.pack(fill='both', expand=True) + + def _create_results_tab(self): + """Create results tab widgets.""" + # Results control buttons + button_frame = ttk.Frame(self.results_frame) + button_frame.pack(fill='x', padx=10, pady=10) + + ttk.Button(button_frame, text="Load Results", command=self._load_results).pack(side=tk.LEFT, padx=5) + ttk.Button(button_frame, text="Export Report", command=self._export_report).pack(side=tk.LEFT, padx=5) + ttk.Button(button_frame, text="Clear", command=self._clear_results).pack(side=tk.LEFT, padx=5) + + # Results display + results_display_frame = ttk.LabelFrame(self.results_frame, text="Detection Results", padding=5) + results_display_frame.pack(fill='both', expand=True, padx=10, pady=5) + + self.results_text = scrolledtext.ScrolledText(results_display_frame, wrap=tk.WORD, state='disabled') + self.results_text.pack(fill='both', expand=True) + + def _browse_directory(self): + """Browse for directory to scan.""" + directory = filedialog.askdirectory(title="Select C++ Project Directory") + if directory: + self.directory_var.set(directory) + + def _browse_output(self): + """Browse for output file path.""" + filepath = filedialog.asksaveasfilename( + title="Save Results As", + defaultextension=".json", + filetypes=[("JSON files", "*.json"), ("All files", "*.*")] + ) + if filepath: + self.output_path_var.set(filepath) + + def _display_device_info(self): + """Display device information.""" + device_info = GPUManager.get_device_info() + info_text = f"CUDA Available: {device_info['cuda_available']}\n" + info_text += f"MPS Available: {device_info['mps_available']}\n" + info_text += f"Device Count: {device_info['device_count']}\n\n" + + if device_info['devices']: + info_text += "Available GPUs:\n" + for device in device_info['devices']: + info_text += f" GPU {device['id']}: {device['name']}\n" + info_text += f" Memory: {device['total_memory'] / (1024**3):.2f} GB\n" + info_text += f" Capability: {device['capability']}\n" + else: + info_text += "No GPUs available. Will use CPU.\n" + + self.device_info_text.delete(1.0, tk.END) + self.device_info_text.insert(1.0, info_text) + + def _setup_logging(self): + """Setup logging to GUI.""" + # Create a custom handler to redirect logs to GUI + class GUILogHandler: + def __init__(self, text_widget, root): + self.text_widget = text_widget + self.root = root + + def write(self, message): + if message.strip(): # Avoid empty lines + self.text_widget.configure(state='normal') + self.text_widget.insert(tk.END, message + '\n') + self.text_widget.see(tk.END) + self.text_widget.configure(state='disabled') + self.root.update_idletasks() + + self.gui_log_handler = GUILogHandler(self.log_text, self.root) + + def _log_message(self, message): + """Log a message to the GUI.""" + self.log_text.configure(state='normal') + self.log_text.insert(tk.END, f"[{datetime.now().strftime('%H:%M:%S')}] {message}\n") + self.log_text.see(tk.END) + self.log_text.configure(state='disabled') + self.root.update_idletasks() + + def _clear_log(self): + """Clear the log display.""" + self.log_text.configure(state='normal') + self.log_text.delete(1.0, tk.END) + self.log_text.configure(state='disabled') + + def _start_detection(self): + """Start the detection process.""" + # Validate inputs + directory = self.directory_var.get() + if not directory or not Path(directory).exists(): + messagebox.showerror("Error", "Please select a valid directory") + return + + output_path = self.output_path_var.get() + if not output_path: + messagebox.showerror("Error", "Please specify an output path") + return + + # Update UI state + self.processing = True + self.start_button.config(state='disabled') + self.stop_button.config(state='normal') + self.progress_bar.start() + self.status_bar.config(text="Processing...") + + # Start processing in a separate thread + thread = threading.Thread(target=self._run_detection, daemon=True) + thread.start() + + def _run_detection(self): + """Run detection process (called in separate thread).""" + try: + self._log_message("Starting detection process...") + + # Get device + device = None if self.device_var.get() == 'auto' else self.device_var.get() + + # Create processor + self.processor = BatchProcessor( + base_model_name=self.base_model_var.get(), + mask_filling_model_name=self.mask_model_var.get(), + device=device, + batch_size=self.batch_size_var.get(), + n_perturbations=self.n_perturbations_var.get() + ) + + # Process directory + max_files = self.max_files_var.get() if self.max_files_var.get() > 0 else None + + self.results = self.processor.process_directory( + root_path=self.directory_var.get(), + output_path=self.output_path_var.get(), + min_size=self.min_size_var.get(), + max_size=self.max_size_var.get(), + max_files=max_files + ) + + self._log_message("Detection complete!") + self._log_message(f"Results saved to: {self.output_path_var.get()}") + + # Display results + self.root.after(0, self._display_results) + + # Show completion message + self.root.after(0, lambda: messagebox.showinfo("Complete", "Detection completed successfully!")) + + except Exception as e: + logger.error(f"Error during detection: {e}") + self._log_message(f"ERROR: {str(e)}") + self.root.after(0, lambda: messagebox.showerror("Error", f"Detection failed: {str(e)}")) + + finally: + # Update UI state + self.root.after(0, self._finish_detection) + + def _stop_detection(self): + """Stop the detection process.""" + # Note: This is a simple implementation. For proper cancellation, + # you'd need to implement cooperative cancellation in the processor + self.processing = False + self._log_message("Stopping detection...") + self._finish_detection() + + def _finish_detection(self): + """Clean up after detection finishes.""" + self.processing = False + self.start_button.config(state='normal') + self.stop_button.config(state='disabled') + self.progress_bar.stop() + self.status_bar.config(text="Ready") + + def _display_results(self): + """Display results in the results tab.""" + if not self.results: + return + + self.results_text.configure(state='normal') + self.results_text.delete(1.0, tk.END) + + # Format and display results + summary = self.results.get('summary', {}) + metadata = self.results.get('metadata', {}) + + output = "=" * 80 + "\n" + output += "DETECTION RESULTS SUMMARY\n" + output += "=" * 80 + "\n\n" + + output += f"Root Path: {metadata.get('root_path', 'N/A')}\n" + output += f"Timestamp: {metadata.get('timestamp', 'N/A')}\n" + output += f"Processing Time: {metadata.get('processing_time_seconds', 0)} seconds\n" + output += f"Device: {metadata.get('detector_config', {}).get('device', 'N/A')}\n\n" + + output += "Summary:\n" + output += f" Total Files Analyzed: {summary.get('total_analyzed', 0)}\n" + output += f" Likely AI-Generated: {summary.get('likely_ai_generated', 0)}\n" + output += f" Possibly AI-Generated: {summary.get('possibly_ai_generated', 0)}\n" + output += f" Likely Human-Written: {summary.get('likely_human_written', 0)}\n" + output += f" AI Percentage: {summary.get('ai_percentage', 0)}%\n\n" + + # Projects + output += "Projects:\n" + for project_name, project_results in self.results.get('projects', {}).items(): + ai_count = sum(1 for r in project_results + if r.get('detection', {}).get('prediction') == 'likely_ai_generated') + output += f"\n {project_name}: {len(project_results)} files, {ai_count} likely AI-generated\n" + + # List suspicious files + for result in project_results: + if result.get('detection', {}).get('prediction') == 'likely_ai_generated': + score = result.get('detection', {}).get('detectcodegpt_score', 0) + output += f" - {result['relative_path']} (score: {score:.4f})\n" + + self.results_text.insert(1.0, output) + self.results_text.configure(state='disabled') + + # Switch to results tab + self.notebook.select(self.results_frame) + + def _load_results(self): + """Load results from a JSON file.""" + filepath = filedialog.askopenfilename( + title="Load Results", + filetypes=[("JSON files", "*.json"), ("All files", "*.*")] + ) + + if filepath: + try: + with open(filepath, 'r', encoding='utf-8') as f: + self.results = json.load(f) + self._display_results() + messagebox.showinfo("Success", "Results loaded successfully") + except Exception as e: + messagebox.showerror("Error", f"Failed to load results: {str(e)}") + + def _export_report(self): + """Export a text report.""" + if not self.results: + messagebox.showwarning("Warning", "No results to export") + return + + filepath = filedialog.asksaveasfilename( + title="Export Report", + defaultextension=".txt", + filetypes=[("Text files", "*.txt"), ("All files", "*.*")] + ) + + if filepath: + try: + processor = BatchProcessor() + report = processor.generate_report(self.results, output_path=filepath) + messagebox.showinfo("Success", f"Report exported to {filepath}") + except Exception as e: + messagebox.showerror("Error", f"Failed to export report: {str(e)}") + + def _clear_results(self): + """Clear the results display.""" + self.results_text.configure(state='normal') + self.results_text.delete(1.0, tk.END) + self.results_text.configure(state='disabled') + self.results = None + + +def main(): + """Main entry point for GUI application.""" + root = tk.Tk() + + # Set style + style = ttk.Style() + style.theme_use('clam') + + # Create and run GUI + app = DetectorGUI(root) + root.mainloop() + + +if __name__ == "__main__": + main() diff --git a/cpp-detector/main_app.py b/cpp-detector/main_app.py new file mode 100755 index 0000000..ac306d8 --- /dev/null +++ b/cpp-detector/main_app.py @@ -0,0 +1,220 @@ +#!/usr/bin/env python3 +""" +DetectCodeGPT C++ Edition - Main Application Entry Point +Supports both GUI and CLI modes for detecting AI-generated C++ code. +""" + +import sys +import argparse +from pathlib import Path +from loguru import logger + +# Configure logger +logger.remove() # Remove default handler +logger.add(sys.stderr, format="{time:HH:mm:ss} | {level: <8} | {message}") + + +def run_gui(): + """Run the GUI application.""" + try: + from gui import main as gui_main + logger.info("Starting GUI application...") + gui_main() + except ImportError as e: + logger.error(f"Failed to import GUI module: {e}") + logger.error("Make sure tkinter is installed: pip install tk") + sys.exit(1) + + +def run_cli(args): + """ + Run the CLI application. + + Args: + args: Parsed command-line arguments + """ + from batch_processor import BatchProcessor + + if not args.directory: + logger.error("Directory path is required for CLI mode") + sys.exit(1) + + directory = Path(args.directory) + if not directory.exists(): + logger.error(f"Directory does not exist: {args.directory}") + sys.exit(1) + + # Create output path if not specified + if not args.output: + timestamp = Path(directory).name + args.output = f"results_{timestamp}.json" + + logger.info("=" * 80) + logger.info("DetectCodeGPT C++ Edition - CLI Mode") + logger.info("=" * 80) + logger.info(f"Directory: {args.directory}") + logger.info(f"Output: {args.output}") + logger.info(f"Base Model: {args.base_model}") + logger.info(f"Mask Model: {args.mask_model}") + logger.info(f"Device: {args.device if args.device else 'auto'}") + logger.info(f"Perturbations: {args.n_perturbations}") + logger.info(f"Batch Size: {args.batch_size}") + logger.info("=" * 80) + + # Create processor + processor = BatchProcessor( + base_model_name=args.base_model, + mask_filling_model_name=args.mask_model, + device=args.device, + batch_size=args.batch_size, + n_perturbations=args.n_perturbations + ) + + # Process directory + try: + max_files = args.max_files if args.max_files > 0 else None + + results = processor.process_directory( + root_path=args.directory, + output_path=args.output, + min_size=args.min_size, + max_size=args.max_size, + max_files=max_files + ) + + # Display summary + logger.info("\n" + "=" * 80) + logger.info("DETECTION COMPLETE") + logger.info("=" * 80) + + summary = results.get('summary', {}) + logger.info(f"Total Files Analyzed: {summary.get('total_analyzed', 0)}") + logger.info(f"Likely AI-Generated: {summary.get('likely_ai_generated', 0)}") + logger.info(f"Possibly AI-Generated: {summary.get('possibly_ai_generated', 0)}") + logger.info(f"Likely Human-Written: {summary.get('likely_human_written', 0)}") + logger.info(f"AI Percentage: {summary.get('ai_percentage', 0)}%") + logger.info(f"\nResults saved to: {args.output}") + + # Generate text report if requested + if args.report: + report = processor.generate_report(results, output_path=args.report) + logger.info(f"Report saved to: {args.report}") + + logger.info("=" * 80) + + except Exception as e: + logger.error(f"Detection failed: {e}") + if args.debug: + raise + sys.exit(1) + + +def show_device_info(): + """Display information about available devices.""" + from cpp_detector import GPUManager + + logger.info("=" * 80) + logger.info("DEVICE INFORMATION") + logger.info("=" * 80) + + device_info = GPUManager.get_device_info() + + logger.info(f"CUDA Available: {device_info['cuda_available']}") + logger.info(f"MPS Available: {device_info['mps_available']}") + logger.info(f"Device Count: {device_info['device_count']}") + + if device_info['devices']: + logger.info("\nAvailable GPUs:") + for device in device_info['devices']: + logger.info(f" GPU {device['id']}: {device['name']}") + logger.info(f" Total Memory: {device['total_memory'] / (1024**3):.2f} GB") + logger.info(f" Compute Capability: {device['capability']}") + else: + logger.info("\nNo GPUs available. Will use CPU for computation.") + + logger.info(f"\nRecommended Device: {GPUManager.get_device()}") + logger.info("=" * 80) + + +def main(): + """Main entry point.""" + parser = argparse.ArgumentParser( + description="DetectCodeGPT C++ Edition - Detect AI-generated C++ code", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Launch GUI + python main_app.py + + # CLI mode - scan directory + python main_app.py --cli --directory /path/to/cpp/project + + # CLI mode with custom settings + python main_app.py --cli --directory /path/to/cpp/project \\ + --output results.json --n-perturbations 100 --device cuda + + # Show device information + python main_app.py --device-info + """ + ) + + parser.add_argument('--cli', action='store_true', + help='Run in CLI mode instead of GUI') + + parser.add_argument('--directory', '-d', type=str, + help='Directory to scan for C++ files (required for CLI mode)') + + parser.add_argument('--output', '-o', type=str, + help='Output JSON file path (default: results_.json)') + + parser.add_argument('--report', '-r', type=str, + help='Generate text report at specified path') + + parser.add_argument('--base-model', type=str, + default='codellama/CodeLlama-7b-hf', + help='Base model for scoring (default: codellama/CodeLlama-7b-hf)') + + parser.add_argument('--mask-model', type=str, + default='Salesforce/codet5p-770m', + help='Mask filling model (default: Salesforce/codet5p-770m)') + + parser.add_argument('--device', type=str, choices=['cuda', 'cpu', 'mps'], + help='Device to use (default: auto-detect)') + + parser.add_argument('--n-perturbations', type=int, default=50, + help='Number of perturbations per sample (default: 50)') + + parser.add_argument('--batch-size', type=int, default=10, + help='Batch size for processing (default: 10)') + + parser.add_argument('--min-size', type=int, default=100, + help='Minimum file size in bytes (default: 100)') + + parser.add_argument('--max-size', type=int, default=100000, + help='Maximum file size in bytes (default: 100000)') + + parser.add_argument('--max-files', type=int, default=0, + help='Maximum number of files to process, 0 for all (default: 0)') + + parser.add_argument('--device-info', action='store_true', + help='Show device information and exit') + + parser.add_argument('--debug', action='store_true', + help='Enable debug mode with full tracebacks') + + args = parser.parse_args() + + # Show device info if requested + if args.device_info: + show_device_info() + return + + # Run in appropriate mode + if args.cli: + run_cli(args) + else: + run_gui() + + +if __name__ == "__main__": + main() diff --git a/cpp-detector/requirements.txt b/cpp-detector/requirements.txt new file mode 100644 index 0000000..8c343e1 --- /dev/null +++ b/cpp-detector/requirements.txt @@ -0,0 +1,35 @@ +# DetectCodeGPT C++ Edition - Requirements +# These are the same as the main project requirements + +# Core deep learning and AI libraries +torch>=2.0.0 +transformers>=4.30.0 +accelerate>=0.20.0 +datasets>=2.12.0 +openai + +# Scientific computing +numpy>=1.24.0 +scipy>=1.10.0 +scikit-learn>=1.2.0 + +# Data processing and analysis +pandas>=2.0.0 +matplotlib>=3.7.0 +seaborn>=0.12.0 + +# Progress bars and logging +tqdm>=4.65.0 +loguru>=0.7.0 + +# Code parsing +tree-sitter>=0.20.0 + +# Web requests (for API calls) +requests>=2.31.0 + +# GUI support +# tkinter is usually included with Python, but if not: +# Linux: sudo apt-get install python3-tk +# macOS: included with Python +# Windows: included with Python diff --git a/requirements.txt b/requirements.txt index 0af96ed..f95d9fd 100644 --- a/requirements.txt +++ b/requirements.txt @@ -24,3 +24,9 @@ tree-sitter # Web requests (for API calls) requests + +# GUI support (for cpp-detector) +# Note: tkinter is usually included with Python, but listed here for completeness +# On Linux: sudo apt-get install python3-tk +# On macOS: included with Python +# On Windows: included with Python