A sleek implementation of a GPT-2-based Large Language Model (LLM) built from scratch using Python and PyTorch, featuring transformer blocks, multi-head self-attention, and optimized text generation.
- GPT-2 Architecture: Transformer blocks with multi-head attention, GELU-activated feed-forward networks, and layer normalization.
- Efficient Tokenization: Byte-pair encoding via TikToken, handling 5,104 tokens.
- Custom Data Pipeline: Sliding window dataset and dataloader for seamless training.
- Text Generation: Deterministic and probabilistic outputs with temperature scaling and top-k sampling.
- Pretraining: 10 epochs with loss perplexity for robust performance.
- Clone the repo:
git clone https://github.com/Rohitw3code/LLM-from-scratch.git cd LLM-from-scratch - Install dependencies:
requirements.txt:
pip install -r requirements.txt
torch>=2.0.0 tiktoken>=0.7.0 - Add your text dataset to the project directory.
- Prepare Data: Tokenize text using TikToken's GPT-2 encoder (see
previous_chapters.py). - Train: Configure
GPT_CONFIG_124Mand run the training loop in4_Pretraining_on_unlabeled_Data.ipynb. - Generate Text: Use
generate_text_simplefor text generation with customizable sampling. - Evaluate: Monitor loss perplexity during training.
LLM-from-scratch/
├── previous_chapters.py # Model, dataset, and dataloader
├── 1_Data-Tokenization.ipynb # Multi-head attention details
├── 2_Self_Attention_mechanism.ipynb # Multi-head attention details
├── 3_LLM_Architecture.ipynb # Model architecture and generation demo
├── 4_Pretraining_on_unlabeled_Data.ipynb # Training and evaluation
├── requirements.txt # Dependencies
└── README.md # Documentation
- Embeddings: Token and positional embeddings for input processing.
- Multi-Head Attention: Captures complex dependencies with 12 heads.
- Feed-Forward Networks: GELU activation for non-linearity.
- Transformer Blocks: 12-layer stack with 124M parameters.
- Config: 50,257 vocab size, 1,024 context length.
- Deterministic: Uses
torch.argmaxfor highest-probability tokens. - Probabilistic: Temperature scaling and top-k sampling for diverse outputs.
- Dataset: 5,104 tokens via TikToken.
- Training: 10 epochs, batch size 4, sequence length 256, stride 128.
- Evaluation: Loss perplexity.
- Add top-p sampling.
- Support fine-tuning for specific tasks.
- Scale to larger model configurations.
- Fork the repo.
- Create a feature branch (
git checkout -b feature). - Commit changes (
git commit -m "Add feature"). - Push (
git push origin feature). - Open a pull request.