MoE-SVD-Distill

A multi-GPU Python script for distilling knowledge from a large "teacher" Large Language Model into a smaller "student" LLM.

Note: The scripts have some "gemini-isms" in them so they might have some goofy lines in them like "BARE METAL FP32" for whatever reason.

System Requirements

UPDATE: Use the new moe_distill_gpu_exp_v2-CORRECT_NAMING.py distill script. It contains a critical bugfix for a bug that was present in the first 2 LLM distill scripts! The new script significantly improves the distilled models quality! Note: This is a resource-intensive script that requires alot of RAM and VRAM. Please review these requirements before proceeding. Make sure you change the output and input file locations in the script so it knows where your files are.

GPU Distillation: The primary multi-GPU script requires a significant amount of VRAM. It has been tested and confirmed to work on systems with 48GB of VRAM.
CPU-based Distillation: If adapting this workflow for a CPU-only environment, a minimum of 64GB of system RAM is required. To prevent crashes during memory peaks, a swap file of 200GB or larger is also strongly recommended.

Features

Multi-GPU Support: Utilizes PyTorch's multiprocessing to significantly speed up the distillation process by distributing the workload across multiple GPUs.
SVD-based Distillation: Employs Singular Value Decomposition (SVD) to project the teacher model's weights into the student's smaller parameter space, preserving maximal information.
Mixture-of-Experts (MoE) Distillation: Implements a "sophisticated" method for distilling MoE layers. This involves creating "fingerprints" of teacher experts, clustering them using KMeans, and then generating synthetic student experts based on these clusters.
LoRA Adapter Generation: The script calculates the difference between the student model's original weights and the newly generated synthetic weights, then extracts LoRA A and B matrices from this difference.
Non-Destructive Workflow: For safety and verification, the script is designed to never delete the temporary files generated by each GPU worker. This gives you full control to inspect the intermediate results before consolidating them.
Flexible Configuration: Easily configure model paths, layer and expert counts, LoRA rank and alpha, and the number of GPUs to use, all from a centralized configuration block.

How It Works

The core of the script is a multi-stage distillation pipeline that maps the parameter space of a large teacher model onto a smaller student model, with the final output being a LoRA adapter that captures the distilled knowledge.

Initialization: The script begins by reading the configuration parameters and loading the student model's weight map to divide the distillation tasks among the available GPUs.
Multi-GPU Worker Spawn: It spawns a separate process for each GPU. Each worker is assigned a subset of the student model's tensors to process.
Layer-to-Layer Mapping: For each layer in the student model, a corresponding layer or a weighted average of two adjacent layers from the teacher model is identified. This is based on the ratio of teacher layers to student layers.
Distillation of Standard Layers: For standard (non-MoE) layers, the script performs the following steps:
- It interpolates the corresponding teacher layer tensors to create a "blended" teacher tensor.
- This blended tensor is then projected into the student tensor's shape using SVD. A Fourier-based projection is used as a fallback in case of SVD errors.
- The difference between the projected teacher tensor and the original student tensor is calculated.
- Finally, LoRA A and B matrices are extracted from this difference tensor using SVD.
Distillation of MoE Layers: These layers undergo a multi-step distillation involving expert fingerprinting, clustering, and weighted projection.
- Fingerprinting: For each expert in the relevant teacher layers, a "fingerprint" is created by concatenating its weights. These fingerprints are then interpolated.
- Clustering: The interpolated fingerprints of the teacher experts are clustered using KMeans to group similar experts. The number of clusters is equal to the number of experts in the student model's MoE layers.
- Synthetic Expert Generation: For each student expert, the script identifies the teacher experts assigned to its cluster. The weights of these teacher experts are blended and projected to create a synthetic student expert.
- LoRA Extraction: As with standard layers, the difference between the synthetic expert and the original student expert is used to generate LoRA weights.
Temporary File Storage: Each GPU worker saves its generated LoRA weights to a separate temporary .safetensors file.
Consolidation: After all workers have completed their tasks, the main process loads the LoRA weights from all the temporary files and consolidates them into a single, final LoRA adapter file.
Adapter Configuration: A corresponding adapter_config.json file is generated, which is necessary for loading the LoRA adapter with libraries like PEFT (Parameter-Efficient Fine-Tuning).

Installing

Clone the repository:

git clone https://github.com/Basedbase-ai/LLM-SVD-distillation-scripts.git
cd LLM-SVD-distillation-scripts

Create a Conda environment (recommended):

conda create -n moe-distill python=3.12
conda activate moe-distill

Install PyTorch first. The specific version will depend on your hardware. For example:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Then run:

pip install -r requirements.txt

Configuration

All the necessary configurations are located in the CONFIGURATION section of the script.

TEACHER_MODEL_FOLDER: The file path to the directory containing the teacher model.
STUDENT_BASE_FOLDER: The file path to the directory containing the student model.
OUTPUT_LORA_PATH: The desired file path for the final, consolidated LoRA weights.
OUTPUT_LORA_CONFIG_PATH: The desired file path for the LoRA adapter configuration JSON file.
MODEL_ARCHITECTURE_CONFIG: A dictionary specifying the number of layers and experts per layer for both the teacher and student models.
RANK_MAP: A dictionary to define the rank of the LoRA matrices for different types of layers. A higher rank can capture more information but results in a larger file size.
LORA_ALPHA: The LoRA scaling factor.
NUM_GPUS: The number of GPUs you wish to utilize for the distillation process.

Usage

Configure the script: Open the LLM_distill_multi_gpu.py file and modify the variables in the CONFIGURATION section to match your models and desired output paths.
Run the script: Execute the script from your terminal:

python LLM_distill_multi_gpu.py
Monitor the process: The script will print progress updates for each GPU worker.
Manual Cleanup: After the script completes, the temporary worker files (temp_lora_weights_*.safetensors) will remain in the directory. Once you have verified that the final LoRA file has been created successfully, you can manually delete these temporary files.

Correcting the `adapter_config.json`

By default, the old distillation scripts can fail to generate a correct adapter_config.json file. The new distill script does not have this issue. A separate script, create_lora_config.py, is provided to fix this issue after the main .safetensors file has been created.

Configure the fix script: Open the create_lora_config.py script.
Match the paths: Update the configuration variables at the top of the file (LORA_FILE_PATH, STUDENT_BASE_FOLDER, OUTPUT_LORA_CONFIG_PATH, etc.) to match the paths you used in the main distillation script.
Run the fix script: Execute it from your terminal. This script will load the large LoRA file, which may use a lot of RAM.

python create_lora_config.py

This will generate a correct adapter_config.json, making your distilled LoRA complete and ready to use.

Disclaimer

This script is an advanced tool designed for users with some familiarity with LLM architectures and the concepts of model distillation and LoRA. The high LoRA ranks in the default configuration are an intentional choice to maximize information transfer from the teacher model. Achieving optimal results for your specific models will likely require experimentation with these parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
LLM_distill_cpu.py		LLM_distill_cpu.py
LLM_distill_multi_gpu.py		LLM_distill_multi_gpu.py
LLM_lora_merge.py		LLM_lora_merge.py
README.md		README.md
moe_distill_gpu_beta4.py		moe_distill_gpu_beta4.py
moe_distill_gpu_exp_v2-CORRECT_NAMING.py		moe_distill_gpu_exp_v2-CORRECT_NAMING.py
quantize_local_current.py		quantize_local_current.py
regen_llm_config.py		regen_llm_config.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoE-SVD-Distill

System Requirements

Features

How It Works

Installing

Configuration

Usage

Correcting the `adapter_config.json`

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Mainframework/LLM-SVD-distillation-scripts

Folders and files

Latest commit

History

Repository files navigation

MoE-SVD-Distill

System Requirements

Features

How It Works

Installing

Configuration

Usage

Correcting the adapter_config.json

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Correcting the `adapter_config.json`

Packages