Codetector is a modular AI-generated content (AIGC) detection framework that allows easy integration of custom datasets, large language models (LLMs), and detection methods. With features like crash recovery and unit-tested code, it provides a flexible and robust pipeline for AIGC generation and detection. Initially designed for code detection, we developed it to produce a large corpus of 113,776 code samples and 12,294,388 detection samples using 17 LLMs and five detection methods. You can read more about the process and results we produced in our paper titled "Codetector: A Framework for Zero-shot Detection of AI-Generated Code".
- Modular: The framework is designed to be flexible and accomodate changes in any layer of the pipeline. Don't like the existing
run_generation_pipeline.py
orrun_detection_pipeline.py
file? Then change it to your heart's content and utilise the the underlying use cases and managers as you wish. - Expandable: Easily integrate new dataset/sources, LLMs, and detection methods in the pipeline. Simply implement/extend the necessary abstract classes and mixins defined in the framework and reference your implementation in the pipeline files.
- Crash Recovery: Crashes in the pipeline can be resumed from the latest batch. Data in the batch will STILL be lost as states are only saved after each batch.
- Reproducibility: The same starting parameters and pipeline configuration will result in the same output when running on the same machine. It has not been tested for other machines yet but should work in theory.
- Tested: The framework is unit tested (with more tests on the way) to increase confidence in its underlying functionality. THOUGH it is up to the end user to ensure that the LLMs they add and use are configured and implemented properly.
- Clone the repository:
git clone https://github.com/Progyo/Codetector
We provide several ways to use our framework. Note that some LLMs may have their own dependencies. For the framework itself, depending on your preferences, choose one of the following ways to install the dependencies:
We provide a environment.yaml
file that automatically sets up the environment. It is configured to install all the necessary packages to support CUDA 11.8 + PyTorch + Hugging Face Transformers. This is the configuration that we used to generate our corpus.
- Run
conda env create -f environment.yaml
to install package. - Activate the environment with
conda activate codetector
. - ❗ You must patch
typing.py
to support generic typedefs in Python. You may runpatcher.py
at your on discretion to automatically patch yourtyping.py
file. Confirm that you are running the script in the environment. Beware that the script modifies base Python packages and may break your Python install. If you prefer to do it manually:- Locate the
typing.py
file used by your conda environment. It is usually located underlib/python3.10/typing.py
, for Python 3.10. - Replace the class
NewType
intyping.py
with the code found here.
- Locate the
For full control of packages being installed, follow these step-by-step instructions on how to install the project using conda.
- Create conda environment using
conda create --name codetector python=3.10
and activate withconda activate codetector
. - ❗ You must patch
typing.py
to support generic typedefs in Python. You may runpatcher.py
at your on discretion to automatically patch yourtyping.py
file. Beware that the script modifies base Python packages and may break your Python install. If you prefer to do it manually:- Locate the
typing.py
file used by your conda environment. It is usually located underlib/python3.10/typing.py
, for Python 3.10. - Replace the class
NewType
intyping.py
with the code found here.
- Locate the
- For the general framework, install the following (or run
pip install -r requirements.txt
):- For progress bars, install tqdm using
pip install tqdm
- For general math operation support, install numpy using
pip install numpy
- For monad support, install oslash using
pip install oslash
- For DataFrame support, install pandas using
pip install pandas
- For Apache Parquet dataset file format support, install pyarrow using
pip install pyarrow
- For Hugging Face Datasets as dataset provider support, install datasets using
pip install datasets
- For NLTKTokenizer and TikTokenTokenizer support, install nltk and tiktoken using
pip install nltk tiktoken
- For plotting graphs and other figures, install matplotlib using
pip install matplotlib
- For calculating ROC and other metrics, install scikit-learn using
pip install scikit-learn
- For progress bars, install tqdm using
- To add CUDA support, install the following:
- Install cudatoolkit using
conda install cudatoolkit=11.8.0
- Install cuda-nvcc 11.8 using
conda install nvidia/label/cuda-11.8.0::cuda-nvcc
- Install cudatoolkit using
- For PyTorch with CUDA is support:
- Install PyTorch using
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
- Install PyTorch using
- To add Hugging Face Transformers LLM inference support (with quantization), install the following:
- Install transformers and optimum using
pip install 'transformers>=4.32.0' 'optimum>=1.12.0'
- Install ctransformers using
pip install 'ctransformers[cuda]>=0.2.24'
- Install bitsandsbytes using
pip install bitsandbytes
- Install accelerate using
pip install 'accelerate>=0.26.0'
- Install transformers and optimum using
- If using OpenAI API:
- For API access to OpenAI (for labelling or generation), install openai using
pip install openai
- Set the following environment variables:
- OPENAI_API_KEY:
conda env config vars set OPENAI_API_KEY=<YOUR API KEY>
- OPENAI_ORG_ID:
conda env config vars set OPENAI_ORG_ID=<YOUR ORG ID>
- OPENAI_API_KEY:
- For API access to OpenAI (for labelling or generation), install openai using
The datasets from various stages of our generation and detection process can be found below. Alongside the datasets, we supply hashlists stored in Python Pickle files that can be used to filter them to match distributions that we used in our paper. Place all of the hash list files in the data
folder in the root directory. It is important to note that the individual datasets themselves are not part of the framework. They can be located outside in the dataset
folder in the root directory. Feel free to add your own implementations in this folder (see Custom Dataset).
Source datasets exclusively contain human-generated code samples. We used a mixture of existing datasets and newly collected ones. In the following we describe how to download and setup the individual datasets.
For APPS and CodeSearchNet, we utilise Hugging Face Datasets. Ensure that you have installed the dependency with pip install datasets
. Then simply import the dataset. You can find the hash lists here.
Usage:
from dataset.apps import APPSDataset
from dataset.codesearchnet import CodeSearchNetPythonDataset
from codetector.filters import DistributionFilter
# Define the datasets
apps = APPSDataset(filters=[DistributionFilter('data/hf_apps_hash.pkl')])
codesearch = CodeSearchNetPythonDataset(filters=[DistributionFilter('data/hf_codesearchnet-python_hash.pkl')])
# Load the datasets
apps.loadDataset()
codesearch.loadDataset()
# Load a batch of 10 samples
batch = apps.loadBatch(10)
# Loop through the samples and print the content
for sample in batch.samples:
print(sample.content)
For LeetCode we utilise two sources. For the "Pre" dataset we utilise Hugging Face Datasets, like APPS and CodeSearchNet. For the "Post" dataset we use our own web-scraped dataset and extend the Dataset
abstract class defined in the framework to implement JSON reading capabilities for the specific dataset. You can find the hash lists here.
For LeetCode Post, you must download the dataset from here and place it under data/leetcode_post/samples.json
.
Usage:
from dataset.leetcode_pre import LeetcodePreDataset
from dataset.leetcode_post import LeetCodePostDataset
from codetector.filters import DistributionFilter
# Define the datasets
leetcodePre = LeetcodePreDataset(filters=[DistributionFilter('data/hf_leetcode-pre_hash.pkl')])
leetcodePost = LeetCodePostDataset(filters=[DistributionFilter('data/leetcode-post_hash.pkl')])
...
For Stack Overflow, like LeetCode, we split the dataset into "Pre" and "Post". We supply the split datasets individually as two compressed Apache Parquet partition datasets. You can download the Pre dataset here and the Post dataset here. Place the .parquet
files in data/stackoverflow_pre
and data/stackoverflow_post
respectively. You can find the hash lists here.
Usage:
from dataset.stackoverflow import ParquetStackOverflowPreDataset, ParquetStackOverflowPostDataset
from codetector.filters import DistributionFilter
# Define the datasets
stackoverflowPre = ParquetStackOverflowPreDataset(filters=[DistributionFilter('data/stackoverflow-pre_hash.pkl')])
stackoverflowPost = ParquetStackOverflowPostDataset(filters=[DistributionFilter('data/stackoverflow-post_hash.pkl')])
...
The generation and detection pipelines require the generated and detection samples to be saved to another dataset. For ease of use, we implemented a generated and detection dataset class (Specifically for CodeSamples and CodeDetectionSamples).
Usage:
from dataset.generated_dataset import ParquetGeneratedCodeDataset
from dataset.detection_dataset import ParquetCodeDetectionDataset
# Define the datasets
generated = ParquetGeneratedCodeDataset()
detection = ParquetCodeDetectionDataset()
...
If you would prefer these datasets to be human-readable at the cost of file size, you can use the XML equivalent classes:
from dataset.generated_dataset import XMLGeneratedCodeDataset
from dataset.detection_dataset import ParquetCodeDetectionDataset
# Define the datasets
generated = XMLGeneratedCodeDataset()
detection = XMLCodeDetectionDataset()
...
The aggregate dataset serves as a way to easily merge datasets together and treat them as one.
Usage:
from codetector.dataset import AggregateDataset
from dataset.stackoverflow import ParquetStackOverflowPreDataset, ParquetStackOverflowPostDataset
from codetector.filters import DistributionFilter
# Define the datasets
stackoverflowPre = ParquetStackOverflowPreDataset(filters=[DistributionFilter('data/stackoverflow-pre_hash.pkl')])
stackoverflowPost = ParquetStackOverflowPostDataset(filters=[DistributionFilter('data/stackoverflow-post_hash.pkl')])
stackoverflow = AggregateDataset([stackoverflowPre,stackoverflowPost])
...
There are two levels at which one can implement their own dataset/source at. The easiest way is to simply extend and implement the necessary abstract methods of an existing dataset format/source e.g.: XMLDataset
, ParquetDataset
, HuggingFaceDataset
. Note that each existing format/source may have its own additional methods that need to be implemented. The second option is to implement the abstract Dataset
class that gives you the most flexibility.
Option 1:
from codetector.dataset import XMLDataset
from codetector.samples import CodeSample
class CustomXML(XMLDataset):
def getContentType(self):
#Define the object type contained in the dataset.
#This allows for serialisation into arbitrary file formats.
return CodeSample
def preProcess(self):
#Called when loading a dataset for the first time.
pass
def getTag(self):
#The tag of the dataset.
return 'custom_xml'
xml = TestXML('data/sample_data')
Option 2:
from codetector.dataset.abstract import Dataset, Sample
class CustomDataset(Dataset):
#Implementation
...
All our figures can be downloaded here. Optionally, you can also generate them yourself, using the supplied .ipynb
notebooks in the notebooks
folder.
This project is licensed under the MIT License - see the LICENSE file for details.