Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,32 @@

How can we develop zero-shot detection of machine generated codes? Welcome to the repository for the research paper: **"Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers."** Our paper has been accepted to the 47th International Conference on Software Engineering (**ICSE 2025**).

## NEW: C++ Edition with GUI

We've created an enhanced version of DetectCodeGPT specifically for analyzing C++ projects! Features include:

- 🚀 **GPU Acceleration**: Automatic GPU detection (CUDA/MPS) for faster processing
- 🖥️ **User-Friendly GUI**: Intuitive interface for configuration and monitoring
- 📁 **Multi-Project Support**: Scan entire directories with multiple C++ projects
- 📊 **Comprehensive Reporting**: JSON output with detailed scores and human-readable reports
- ⚡ **CLI Mode**: Command-line interface for automation

**Quick Start:**

```bash
cd cpp-detector
python main_app.py # Launch GUI
# OR
python main_app.py --cli --directory /path/to/cpp/projects # CLI mode
```

See [cpp-detector/README.md](cpp-detector/README.md) for detailed documentation.

## Table of Contents

- [DetectCodeGPT](#detectcodegpt)
- [Table of Contents](#table-of-contents)
- [NEW: C++ Edition with GUI](#new-c-edition-with-gui)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Data Preparation](#data-preparation)
Expand Down
302 changes: 302 additions & 0 deletions cpp-detector/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
# DetectCodeGPT C++ Edition

An enhanced version of DetectCodeGPT specifically designed to detect AI-generated C++ code across multiple projects. Features GPU acceleration, a user-friendly GUI, and comprehensive JSON reporting.

## Features

- **Multi-Project Support**: Scan entire directories containing multiple C++ projects
- **GPU Acceleration**: Automatically detects and uses available GPUs (CUDA, MPS) for faster processing
- **User-Friendly GUI**: Intuitive interface for configuration and real-time progress monitoring
- **CLI Mode**: Command-line interface for automation and scripting
- **Comprehensive Output**: JSON results with detailed scores and human-readable reports
- **Flexible Configuration**: Customizable detection parameters and model selection

## Installation

### Prerequisites

- Python 3.9.7 or higher
- CUDA-capable GPU (optional, but recommended for better performance)

### Install Dependencies

From the repository root:

```bash
pip install -r requirements.txt
```

Or install specific requirements for the C++ detector:

```bash
cd cpp-detector
pip install -r requirements.txt
```

## Usage

### GUI Mode (Recommended)

Launch the graphical interface:

```bash
cd cpp-detector
python main_app.py
```

The GUI provides:
- **Configuration Tab**: Set up directories, models, and detection parameters
- **Processing Tab**: Monitor real-time progress and logs
- **Results Tab**: View detection results and export reports

### CLI Mode

For automation or headless environments:

```bash
# Basic usage
python main_app.py --cli --directory /path/to/cpp/projects

# With custom settings
python main_app.py --cli \
--directory /path/to/cpp/projects \
--output results.json \
--report report.txt \
--n-perturbations 100 \
--device cuda

# Show available devices
python main_app.py --device-info
```

### CLI Options

```
--cli Run in CLI mode instead of GUI
--directory, -d Directory to scan for C++ files (required for CLI)
--output, -o Output JSON file path
--report, -r Generate text report at specified path
--base-model Base model for scoring (default: codellama/CodeLlama-7b-hf)
--mask-model Mask filling model (default: Salesforce/codet5p-770m)
--device Device to use: cuda, cpu, or mps (default: auto-detect)
--n-perturbations Number of perturbations per sample (default: 50)
--batch-size Batch size for processing (default: 10)
--min-size Minimum file size in bytes (default: 100)
--max-size Maximum file size in bytes (default: 100000)
--max-files Maximum files to process, 0 for all (default: 0)
--device-info Show device information and exit
--debug Enable debug mode with full tracebacks
```

## How It Works

DetectCodeGPT C++ Edition uses a multi-step detection process:

1. **File Scanning**: Recursively scans directories for C++ files (.cpp, .hpp, .h, .cc, etc.)
2. **Project Organization**: Groups files by project based on build system files (CMakeLists.txt, Makefile, etc.)
3. **Code Preprocessing**: Normalizes and prepares C++ code for analysis
4. **Perturbation Generation**: Creates multiple perturbed versions of each code sample
5. **Likelihood Scoring**: Computes log-likelihood and rank scores using language models
6. **Detection**: Applies DetectCodeGPT algorithm to classify code as AI-generated or human-written

### Detection Scores

The tool provides three main detection scores:

- **DetectCodeGPT Score (NPR)**: Normalized Perturbation Rank - primary metric
- Values > 1.0: Likely AI-generated
- Values 0.95-1.0: Possibly AI-generated
- Values < 0.95: Likely human-written

- **DetectGPT Score**: Curvature-based detection from original DetectGPT paper
- **LRR Score**: Log-likelihood to Rank Ratio

## Output Format

### JSON Output

The tool generates a comprehensive JSON file with:

```json
{
"metadata": {
"root_path": "/path/to/projects",
"timestamp": "2025-11-08T12:00:00",
"processing_time_seconds": 123.45,
"files_processed": 50,
"device_info": {...}
},
"summary": {
"total_analyzed": 50,
"likely_ai_generated": 5,
"possibly_ai_generated": 3,
"likely_human_written": 42,
"ai_percentage": 10.0
},
"projects": {
"project_name": [
{
"file_path": "/full/path/to/file.cpp",
"relative_path": "src/file.cpp",
"detection": {
"detectcodegpt_score": 1.05,
"prediction": "likely_ai_generated",
"confidence": "high",
...
}
}
]
}
}
```

### Text Report

Generate a human-readable summary:

```
================================================================================
C++ AI-Generated Code Detection Report
================================================================================

Root Path: /path/to/projects
Timestamp: 2025-11-08T12:00:00
Processing Time: 123.45 seconds
Device: cuda

Summary:
Total Files Analyzed: 50
Likely AI-Generated: 5 (10.0%)
Possibly AI-Generated: 3
Likely Human-Written: 42
Errors: 0

Projects:
project1:
Files: 25
Likely AI-Generated: 2
Suspicious files:
- src/algorithm.cpp (score: 1.0234)
- src/utils.cpp (score: 1.0156)
```

## Supported C++ File Types

- `.cpp` - C++ source files
- `.cc` - C++ source files (alternative extension)
- `.cxx` - C++ source files (alternative extension)
- `.c++` - C++ source files (alternative extension)
- `.hpp` - C++ header files
- `.h` - C/C++ header files
- `.hh` - C++ header files (alternative extension)
- `.hxx` - C++ header files (alternative extension)
- `.h++` - C++ header files (alternative extension)

## GPU Support

The tool automatically detects and uses available GPUs:

- **NVIDIA GPUs**: CUDA support (recommended)
- **Apple Silicon**: Metal Performance Shaders (MPS)
- **CPU Fallback**: Works without GPU but slower

Check GPU availability:

```bash
python main_app.py --device-info
```

## Model Options

### Base Models (for scoring)

- `codellama/CodeLlama-7b-hf` (default, recommended for C++)
- `codellama/CodeLlama-13b-hf` (larger, more accurate, slower)
- `Salesforce/codegen-2B-mono`
- `microsoft/CodeGPT-small-py`

### Mask Filling Models

- `Salesforce/codet5p-770m` (default, recommended)
- `Salesforce/codet5-base`
- `Salesforce/CodeT5-large`

## Performance Tips

1. **Use GPU**: Significant speedup with CUDA-capable GPU
2. **Adjust Batch Size**: Increase for faster processing (if memory allows)
3. **Reduce Perturbations**: Lower `n-perturbations` for faster but less accurate results
4. **File Size Limits**: Adjust `min-size` and `max-size` to focus on relevant files
5. **File Limit**: Use `max-files` for quick testing on large codebases

## Troubleshooting

### Out of Memory Errors

- Reduce `batch-size`
- Reduce `n-perturbations`
- Use a smaller base model
- Process fewer files at once with `max-files`

### Slow Processing

- Ensure GPU is being used (check with `--device-info`)
- Increase `batch-size` if memory allows
- Reduce `n-perturbations`

### Import Errors

Make sure all dependencies are installed:

```bash
pip install -r ../requirements.txt
```

## Examples

### Scan a single project

```bash
python main_app.py --cli --directory ~/my_cpp_project --output results.json
```

### Scan with detailed reporting

```bash
python main_app.py --cli \
--directory ~/cpp_projects \
--output results.json \
--report detailed_report.txt \
--n-perturbations 100
```

### Quick scan for testing

```bash
python main_app.py --cli \
--directory ~/large_codebase \
--max-files 10 \
--n-perturbations 25
```

## License

This project extends DetectCodeGPT and is licensed under the MIT License. See the LICENSE file in the repository root for details.

## Citation

If you use this tool in your research, please cite the original DetectCodeGPT paper:

```bibtex
@inproceedings{shi2025detectcodegpt,
title={Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers},
author={Shi, Yuling and Zhang, Hongyu and Wan, Chengcheng and Gu, Xiaodong},
booktitle={Proceedings of the 47th International Conference on Software Engineering (ICSE 2025)},
year={2025},
organization={IEEE}
}
```

## Acknowledgements

Based on DetectCodeGPT by Shi et al., with enhancements for C++ code analysis, GPU acceleration, and user interface improvements.
19 changes: 19 additions & 0 deletions cpp-detector/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""
DetectCodeGPT C++ Edition
AI-generated C++ code detection with GPU acceleration and GUI.
"""

__version__ = "1.0.0"
__author__ = "DetectCodeGPT Contributors"

from .cpp_detector import CppCodeDetector, GPUManager
from .file_scanner import CppFileScanner, scan_cpp_files
from .batch_processor import BatchProcessor

__all__ = [
'CppCodeDetector',
'GPUManager',
'CppFileScanner',
'scan_cpp_files',
'BatchProcessor'
]
Loading