LLM Text Compressor

A novel text compression system that leverages Large Language Models (specifically GPT-2) to achieve high compression ratios by predicting token sequences and storing only prediction ranks instead of raw tokens.

Overview

This project implements an innovative approach to text compression that combines the predictive power of modern language models with traditional compression techniques. Instead of storing the actual tokens, the system stores the rank of each token within the language model's probability-ordered predictions, which are then further compressed using arithmetic coding.

Key Features

High Compression Ratio: Achieves ~74% compression on test data (Alice in Wonderland: 41,790 bytes → 10,620 bytes)
Lossless Compression: Perfect reconstruction of original text
LLM-Powered: Uses GPT-2's predictive capabilities for intelligent compression
Sliding Window Approach: Maintains context with configurable memory window size
Arithmetic Coding: Secondary compression layer for optimal storage efficiency

How It Works

Compression Process

Tokenization: Input text is tokenized using GPT-2's tokenizer
Sliding Window: A memory window of size M (default: 16) slides through the token sequence
Prediction: For each position, GPT-2 predicts the probability distribution of the next token
Rank Calculation: Instead of storing the actual token, the system stores its rank in the sorted probability distribution
Arithmetic Coding: The sequence of ranks is further compressed using arithmetic coding
Storage: The compressed data is stored in a binary file

Decompression Process

Decode: Arithmetic coding is reversed to recover the rank sequence
Reconstruction: For each rank, GPT-2 generates predictions and selects the token at that rank
Sliding Window: The context window is updated with each predicted token
Detokenization: The token sequence is converted back to readable text

Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended for performance)
~2GB free disk space for GPT-2 model

Dependencies

Install the required packages:

pip install -r requirements.txt

The main dependencies include:

torch - PyTorch framework
transformers - Hugging Face Transformers for GPT-2
numpy - Numerical computations
Standard library modules for file I/O and data structures

Usage

Basic Usage

The system is designed to work out-of-the-box with the provided sample text:

python main.py

This will:

Load the GPT-2 model and tokenizer
Read alice_in_wonderland.txt
Compress it to compressed.bin
Decompress it to decompressed.txt

Custom Text Compression

To compress your own text file:

Replace alice_in_wonderland.txt with your text file, or
Modify the filename in main.py (line 17):

with open("your_text_file.txt", "r") as file:
    text = file.read()

Configuration Options

Memory Window Size (M)

Adjust the context window size by modifying the M parameter in main.py:

M = 16  # Default value, can be adjusted for different compression/quality tradeoffs

Larger M: Better context, potentially better compression, but slower processing
Smaller M: Faster processing, but may reduce compression efficiency

Arithmetic Coding Precision

Modify the precision in arithmetic_coding.py:

class ArithmeticCoder:
    def __init__(self, precision=32):  # Adjust precision as needed

Project Structure

LLM-Text-Compressor/
│
├── main.py                 # Main entry point and orchestration
├── compress.py             # Core compression logic
├── decompress.py           # Core decompression logic
├── arithmetic_coding.py    # Arithmetic coding implementation
├── requirements.txt        # Python dependencies
├── alice_in_wonderland.txt # Sample input text
├── compressed.bin          # Compressed output (generated)
├── decompressed.txt        # Decompressed output (generated)
└── README.md              # This documentation

Technical Details

Algorithm Components

1. LLM Rank Compression (`compress.py`)

def compress(input_ids, model, M=4) -> List[int]:

Uses sliding window approach with memory size M
For each token position, generates GPT-2 predictions
Computes rank of actual token in sorted prediction probabilities
Returns list of ranks instead of original tokens

2. LLM Rank Decompression (`decompress.py`)

def decompress(ranks, input_ids, tokenizer, model, M=4) -> str:

Reconstructs text by using ranks to select tokens from GPT-2 predictions
Maintains sliding window context during reconstruction
Returns fully reconstructed text string

3. Arithmetic Coding (`arithmetic_coding.py`)

Implements standard arithmetic coding with:

Encoding: Converts rank sequence to single compressed integer
Decoding: Recovers original rank sequence from compressed data
File I/O: Handles binary storage with frequency tables

Performance Metrics

Based on the included Alice in Wonderland sample:

Metric	Value
Original Size	41,790 bytes
Compressed Size	10,620 bytes
Compression Ratio	~74.6%
Space Savings	~25.4% of original

Memory Requirements

GPU Memory: ~2GB for GPT-2 model
System RAM: ~1GB for processing
Disk Space: Original text size + ~25% for compressed output

Processing Time

Processing time scales with:

Text length (linear)
Memory window size M (linear)
GPU performance (significant impact)

Examples

Compression Example

from compress import compress
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Compress text
text = "Your text here..."
input_ids = tokenizer.encode(text, return_tensors="pt")
ranks = compress(input_ids, model, M=16)

Decompression Example

from decompress import decompress

# Decompress ranks back to text
reconstructed_text = decompress(ranks, input_ids[0][:M], tokenizer, model, M)

Limitations and Considerations

Current Limitations

GPU Dependency: Requires CUDA-compatible GPU for practical performance
Model Size: GPT-2 model requires significant disk space and memory
Processing Speed: Compression/decompression is slower than traditional algorithms
Text Domain: Performance may vary significantly across different text types

Best Use Cases

Academic Research: Novel compression algorithm research
Long-form Text: Books, articles, documents with rich linguistic structure
Educational Purposes: Understanding LLM applications in compression

Not Recommended For

Real-time Applications: Due to processing overhead
Binary Data: Designed specifically for natural language text
Short Text Snippets: Overhead may exceed benefits

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Development Setup

git clone https://github.com/SatvikG7/LLM-Text-Compressor.git
cd LLM-Text-Compressor
pip install -r requirements.txt

License

This project is open source. Please refer to the repository license for specific terms.

Research and References

This implementation is based on the concept of using language model predictions for text compression. The approach demonstrates how modern NLP models can be applied to traditional computer science problems like data compression.

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce batch size or use smaller memory window (M)
Model Download Issues: Ensure stable internet connection for initial GPT-2 download
Performance Issues: Verify CUDA installation and GPU availability

Getting Help

Check that all dependencies are correctly installed
Verify GPU drivers and CUDA installation
Ensure sufficient disk space for model and output files

For questions, issues, or contributions, please visit the project repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
alice_in_wonderland.txt		alice_in_wonderland.txt
arithmetic_coding.py		arithmetic_coding.py
compress.py		compress.py
compressed.bin		compressed.bin
decompress.py		decompress.py
decompressed.txt		decompressed.txt
main.py		main.py
requirements.txt		requirements.txt

SatvikG7/LLM-Text-Compressor

Folders and files

Latest commit

History

Repository files navigation