A novel text compression system that leverages Large Language Models (specifically GPT-2) to achieve high compression ratios by predicting token sequences and storing only prediction ranks instead of raw tokens.
This project implements an innovative approach to text compression that combines the predictive power of modern language models with traditional compression techniques. Instead of storing the actual tokens, the system stores the rank of each token within the language model's probability-ordered predictions, which are then further compressed using arithmetic coding.
- High Compression Ratio: Achieves ~74% compression on test data (Alice in Wonderland: 41,790 bytes → 10,620 bytes)
- Lossless Compression: Perfect reconstruction of original text
- LLM-Powered: Uses GPT-2's predictive capabilities for intelligent compression
- Sliding Window Approach: Maintains context with configurable memory window size
- Arithmetic Coding: Secondary compression layer for optimal storage efficiency
- Tokenization: Input text is tokenized using GPT-2's tokenizer
- Sliding Window: A memory window of size M (default: 16) slides through the token sequence
- Prediction: For each position, GPT-2 predicts the probability distribution of the next token
- Rank Calculation: Instead of storing the actual token, the system stores its rank in the sorted probability distribution
- Arithmetic Coding: The sequence of ranks is further compressed using arithmetic coding
- Storage: The compressed data is stored in a binary file
- Decode: Arithmetic coding is reversed to recover the rank sequence
- Reconstruction: For each rank, GPT-2 generates predictions and selects the token at that rank
- Sliding Window: The context window is updated with each predicted token
- Detokenization: The token sequence is converted back to readable text
- Python 3.8+
- CUDA-compatible GPU (recommended for performance)
- ~2GB free disk space for GPT-2 model
Install the required packages:
pip install -r requirements.txt
The main dependencies include:
torch
- PyTorch frameworktransformers
- Hugging Face Transformers for GPT-2numpy
- Numerical computations- Standard library modules for file I/O and data structures
The system is designed to work out-of-the-box with the provided sample text:
python main.py
This will:
- Load the GPT-2 model and tokenizer
- Read
alice_in_wonderland.txt
- Compress it to
compressed.bin
- Decompress it to
decompressed.txt
To compress your own text file:
- Replace
alice_in_wonderland.txt
with your text file, or - Modify the filename in
main.py
(line 17):
with open("your_text_file.txt", "r") as file:
text = file.read()
Adjust the context window size by modifying the M
parameter in main.py
:
M = 16 # Default value, can be adjusted for different compression/quality tradeoffs
- Larger M: Better context, potentially better compression, but slower processing
- Smaller M: Faster processing, but may reduce compression efficiency
Modify the precision in arithmetic_coding.py
:
class ArithmeticCoder:
def __init__(self, precision=32): # Adjust precision as needed
LLM-Text-Compressor/
│
├── main.py # Main entry point and orchestration
├── compress.py # Core compression logic
├── decompress.py # Core decompression logic
├── arithmetic_coding.py # Arithmetic coding implementation
├── requirements.txt # Python dependencies
├── alice_in_wonderland.txt # Sample input text
├── compressed.bin # Compressed output (generated)
├── decompressed.txt # Decompressed output (generated)
└── README.md # This documentation
def compress(input_ids, model, M=4) -> List[int]:
- Uses sliding window approach with memory size M
- For each token position, generates GPT-2 predictions
- Computes rank of actual token in sorted prediction probabilities
- Returns list of ranks instead of original tokens
def decompress(ranks, input_ids, tokenizer, model, M=4) -> str:
- Reconstructs text by using ranks to select tokens from GPT-2 predictions
- Maintains sliding window context during reconstruction
- Returns fully reconstructed text string
Implements standard arithmetic coding with:
- Encoding: Converts rank sequence to single compressed integer
- Decoding: Recovers original rank sequence from compressed data
- File I/O: Handles binary storage with frequency tables
Based on the included Alice in Wonderland sample:
Metric | Value |
---|---|
Original Size | 41,790 bytes |
Compressed Size | 10,620 bytes |
Compression Ratio | ~74.6% |
Space Savings | ~25.4% of original |
- GPU Memory: ~2GB for GPT-2 model
- System RAM: ~1GB for processing
- Disk Space: Original text size + ~25% for compressed output
Processing time scales with:
- Text length (linear)
- Memory window size M (linear)
- GPU performance (significant impact)
from compress import compress
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Compress text
text = "Your text here..."
input_ids = tokenizer.encode(text, return_tensors="pt")
ranks = compress(input_ids, model, M=16)
from decompress import decompress
# Decompress ranks back to text
reconstructed_text = decompress(ranks, input_ids[0][:M], tokenizer, model, M)
- GPU Dependency: Requires CUDA-compatible GPU for practical performance
- Model Size: GPT-2 model requires significant disk space and memory
- Processing Speed: Compression/decompression is slower than traditional algorithms
- Text Domain: Performance may vary significantly across different text types
- Academic Research: Novel compression algorithm research
- Long-form Text: Books, articles, documents with rich linguistic structure
- Educational Purposes: Understanding LLM applications in compression
- Real-time Applications: Due to processing overhead
- Binary Data: Designed specifically for natural language text
- Short Text Snippets: Overhead may exceed benefits
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
git clone https://github.com/SatvikG7/LLM-Text-Compressor.git
cd LLM-Text-Compressor
pip install -r requirements.txt
This project is open source. Please refer to the repository license for specific terms.
This implementation is based on the concept of using language model predictions for text compression. The approach demonstrates how modern NLP models can be applied to traditional computer science problems like data compression.
- CUDA Out of Memory: Reduce batch size or use smaller memory window (M)
- Model Download Issues: Ensure stable internet connection for initial GPT-2 download
- Performance Issues: Verify CUDA installation and GPU availability
- Check that all dependencies are correctly installed
- Verify GPU drivers and CUDA installation
- Ensure sufficient disk space for model and output files
For questions, issues, or contributions, please visit the project repository.