Skip to content

ocentra/BitNet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bitnet.cpp

License: MIT version

BitNet Model on Hugging Face

Try it out via this demo, or build and run it on your own CPU or GPU.

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.

m2_performance

m2_performance

The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.

Demo

A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:

demo.mp4

What's New:

Acknowledgements

This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.

Official Models

Model Parameters CPU Kernel
I2_S TL1 TL2
BitNet-b1.58-2B-4T 2.4B x86
ARM

Supported Models

❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.

Model Parameters CPU Kernel
I2_S TL1 TL2
bitnet_b1_58-large 0.7B x86
ARM
bitnet_b1_58-3B 3.3B x86
ARM
Llama3-8B-1.58-100B-tokens 8.0B x86
ARM
Falcon3 Family 1B-10B x86
ARM
Falcon-E Family 1B-3B x86
ARM

Installation

Requirements

  • python>=3.9
  • cmake>=3.22
  • clang>=18
    • For Windows users, install Visual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):

      • Desktop-development with C++
      • C++-CMake Tools for Windows
      • Git for Windows
      • C++-Clang Compiler for Windows
      • MS-Build Support for LLVM-Toolset (clang)
    • For Debian/Ubuntu users, you can download with Automatic installation script

      bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

  • conda (highly recommend)

Build from source

Important

If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.

  1. Clone the repo
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
  1. Install the dependencies
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp

pip install -r requirements.txt
  1. Build the project
# Manually download the model and run with local path
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
                    [--use-pretuned]

Setup the environment for running inference

optional arguments:
  -h, --help            show this help message and exit
  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
                        Model used for inference
  --model-dir MODEL_DIR, -md MODEL_DIR
                        Directory to save/load the model
  --log-dir LOG_DIR, -ld LOG_DIR
                        Directory to save the logging info
  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
                        Quantization type
  --quant-embd          Quantize the embeddings to f16
  --use-pretuned, -p    Use the pretuned kernel parameters

Build Matrix & Compiler Requirements

Complete Build Variants

The BitNet build system generates 15 optimized variants for maximum performance across different hardware:

CPU Builds (12 variants)

Variant Target Hardware Compiler SIMD Features
standard Generic x86-64 GCC/Clang SSE, AVX
bitnet-portable Modern CPUs (baseline) Clang 14+ AVX2, FMA
bitnet-amd-zen1 AMD Ryzen 1000 / EPYC 7001 Clang 14+ Zen 1 optimizations
bitnet-amd-zen2 AMD Ryzen 3000 / EPYC 7002 Clang 14+ Zen 2 optimizations
bitnet-amd-zen3 AMD Ryzen 5000 / EPYC 7003 Clang 14+ Zen 3 optimizations
bitnet-amd-zen4 AMD Ryzen 7000 / EPYC 7004 Clang 17+ Zen 4 optimizations, AVX-512
bitnet-intel-haswell Intel 4th gen (2013-2015) Clang 14+ Haswell optimizations
bitnet-intel-broadwell Intel 5th gen (2014-2016) Clang 14+ Broadwell optimizations
bitnet-intel-skylake Intel 6th-9th gen (2015-2019) Clang 14+ Skylake optimizations
bitnet-intel-icelake Intel 10th gen mobile (2019) Clang 14+ Ice Lake optimizations
bitnet-intel-rocketlake Intel 11th gen (2021) Clang 14+ Rocket Lake optimizations
bitnet-intel-alderlake Intel 12th-14th gen (2021+) Clang 14+ Alder Lake optimizations

GPU Builds (3 variants)

Variant Technology Hardware Support
standard-cuda-vulkan CUDA + Vulkan NVIDIA GPUs (primary) + AMD/Intel (Vulkan)
standard-opencl OpenCL Universal (AMD, Intel, NVIDIA)
bitnet-python-cuda Python CUDA kernels NVIDIA GPUs

Compiler Requirements by Platform

Windows

  • Visual Studio 2022 (Community/Professional/Enterprise)
    • Includes: ClangCL, CMake, MSBuild
  • CUDA Toolkit 12.1+ (for GPU builds)
  • Vulkan SDK (for GPU builds)

Linux (Ubuntu 22.04+)

  • Base requirements:

    • clang-14 (default, supports most CPUs)
    • cmake 3.22+
    • gcc 11+
    • python 3.9-3.11
  • For AMD Zen 4 support:

    # Install Clang 17
    wget -qO- https://apt.llvm.org/llvm-snapshot.gpg.key | sudo tee /etc/apt/trusted.gpg.d/apt.llvm.org.asc
    sudo add-apt-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-17 main"
    sudo apt update
    sudo apt install clang-17
  • For AMD Zen 5 support (optional, <0.1% market share):

    # Requires Clang 18+ (not yet available in stable Ubuntu 22.04 repos)
    # Wait for Ubuntu 24.04+ or build Clang 18 from source
    # Alternatively, upgrade to Ubuntu 24.04 when available
  • For GPU builds:

    # CUDA
    sudo apt install nvidia-cuda-toolkit
    
    # OpenCL
    sudo apt install ocl-icd-opencl-dev
    
    # Vulkan
    wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add -
    sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
    sudo apt update
    sudo apt install vulkan-sdk

Automated Build Scripts

Windows: Complete Build (All Variants)

# Build all 16 variants (12 CPU + 3 GPU + 1 standard)
.\build_complete.ps1

# Build specific variants only
.\build_complete.ps1 -BuildVariants "bitnet-amd-zen3,standard-gpu"

# Clean build (removes existing artifacts)
.\build_complete.ps1 -Clean

# List available variants
.\build_complete.ps1 -ListVariants

Linux: Complete Build (All Variants)

# Build all 15 variants (12 CPU + 3 GPU)
bash build-all-linux.sh

# Build specific variants only
bash build-all-linux.sh --variants bitnet-amd-zen4,standard-opencl

# Clean build (removes existing artifacts)
bash build-all-linux.sh --clean

# List available variants
bash build-all-linux.sh --list-variants

Build Output Structure

All builds are organized into isolated, self-contained directories:

BitnetRelease/
├── cpu/
│   ├── windows/
│   │   ├── standard/              (58 files - llama.cpp standard)
│   │   ├── bitnet-portable/       (41 files - AVX2 baseline)
│   │   ├── bitnet-amd-zen2/       (41 files - optimized for your CPU)
│   │   ├── bitnet-intel-skylake/  (41 files - optimized for your CPU)
│   │   └── ...
│   └── linux/
│       ├── standard/              (58 files - llama.cpp standard)
│       ├── bitnet-portable/       (41 files - AVX2 baseline)
│       └── ...
└── gpu/
    ├── windows/
    │   ├── standard-cuda-vulkan/  (56 files - CUDA + Vulkan)
    │   ├── standard-opencl/       (55 files - OpenCL)
    │   └── bitnet-python-cuda/    (15 files - Python CUDA)
    └── linux/
        ├── standard-cuda-vulkan/  (56 files - CUDA + Vulkan)
        ├── standard-opencl/       (55 files - OpenCL)
        └── bitnet-python-cuda/    (15 files - Python CUDA)

Each variant directory is 100% self-contained - you can zip any folder and distribute it directly!

Performance Tips

  1. CPU Selection: Use the variant matching your specific CPU generation for best performance

    • Zen 3 optimizations can be 15-20% faster than portable on Ryzen 5000
    • Intel 12th gen (Alder Lake) gets significant boost with its specific variant
  2. Backwards Compatibility: Newer variants work on older CPUs

    • bitnet-amd-zen3 will run on Zen 2, but may be slightly slower
    • bitnet-intel-alderlake will run on Skylake, but won't be optimal
  3. GPU Selection:

    • NVIDIA users: Use standard-cuda-vulkan (fastest) or bitnet-python-cuda (most flexible)
    • AMD/Intel users: Use standard-opencl (universal compatibility)
    • Multi-GPU: standard-cuda-vulkan supports Vulkan fallback

Usage

Basic usage

# Run inference with the quantized model
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]

Run inference

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Path to model file
  -n N_PREDICT, --n-predict N_PREDICT
                        Number of tokens to predict when generating text
  -p PROMPT, --prompt PROMPT
                        Prompt to generate text from
  -t THREADS, --threads THREADS
                        Number of threads to use
  -c CTX_SIZE, --ctx-size CTX_SIZE
                        Size of the prompt context
  -temp TEMPERATURE, --temperature TEMPERATURE
                        Temperature, a hyperparameter that controls the randomness of the generated text
  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)

Benchmark

We provide scripts to run the inference benchmark providing a model.

usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]  
   
Setup the environment for running the inference  
   
required arguments:  
  -m MODEL, --model MODEL  
                        Path to the model file. 
   
optional arguments:  
  -h, --help  
                        Show this help message and exit. 
  -n N_TOKEN, --n-token N_TOKEN  
                        Number of generated tokens. 
  -p N_PROMPT, --n-prompt N_PROMPT  
                        Prompt to generate text from. 
  -t THREADS, --threads THREADS  
                        Number of threads to use. 

Here's a brief explanation of each argument:

  • -m, --model: The path to the model file. This is a required argument that must be provided when running the script.
  • -n, --n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.
  • -p, --n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.
  • -t, --threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.
  • -h, --help: Show the help message and exit. Use this argument to display usage information.

For example:

python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4  

This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:

python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M

# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128

Convert from .safetensors Checkpoints

# Prepare the .safetensors model file
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16

# Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16

FAQ (Frequently Asked Questions)📌

Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?

A: This is an issue introduced in recent version of llama.cpp. Please refer to this commit in the discussion to fix this issue.

Q2: How to build with clang in conda environment on windows?

A: Before building the project, verify your clang installation and access to Visual Studio tools by running:

clang -v

This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:

'clang' is not recognized as an internal or external command, operable program or batch file.

It indicates that your command line window is not properly initialized for Visual Studio tools.

• If you are using Command Prompt, run:

"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64

• If you are using Windows PowerShell, run the following commands:

Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"

These steps will initialize your environment and allow you to use the correct Visual Studio tools.

About

Official inference framework for 1-bit LLMs

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 45.3%
  • Python 36.9%
  • PowerShell 8.5%
  • Shell 8.5%
  • CMake 0.4%
  • Cuda 0.2%
  • C 0.2%