Skip to content

Misc. bug: Finetuning yields different and worse results using CPU backend vs. CUDA backend #15779

@AmyFromSonic

Description

@AmyFromSonic

Name and Version

$./build/bin/llama-cli --version
version: 5358 (10d2af0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

CPU: ./build/bin/llama-finetune --file ./gsm8k_with_newlines_first_125.txt --model smollm2-135M-base.gguf -b 512 -ub 512 -np 1 --device none

CUDA: CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-finetune --file ./gsm8k_with_newlines_first_125.txt --model smollm2-135M-base.gguf -b 512 -ub 512 -np 1 -ngl 999

Problem description & steps to reproduce

Hey all,

I've noticed that examples/training/finetune.cpp code yields different and worse results when using the CPU backend.

As a minimal example to show this, I used finetune.cpp in the following situation:

  • Model: SmolLM2-135M Base (https://huggingface.co/HuggingFaceTB/SmolLM2-135M). This was converted to GGUF using llama.cpp/convert_hf_to_gguf.py specifying f32 as the --outtype.
  • Dataset: First 125 samples from GSM8K (https://huggingface.co/datasets/openai/gsm8k), stored in a single newline-delimited file (attached below). This small subset enables quick testing, since CPU training can be quite slow.
  • Hyperparameters: Defaults used by the examples/training code - except the following:
    • Epochs to train for: Modified from 2 -> 10
    • Learning rate upped from 1e-7 -> 1e-6

I would expect SmolLM2-135M Base to quickly learn on (and maybe overfit) this small amount of data. Finetuning using the CUDA backend shows this is the case, with training loss decreasing over time.

I would expect similar behavior when finetuning using the CPU backend given that this is the same model, same dataset, and same hyperparameters. The only difference is the backend. However, this is showing that models finetuned with the CPU aren't learning.

I believe there's a bug in the CPU backend of the finetuning code specifically that is causing this discrepancy.

Results

I've attached the logfiles from running below:

llamacpp_finetune_smollm2135mbase_on_gsmk_125samples_cpu.log
llamacpp_finetune_smollm2135mbase_on_gsmk_125samples_gpu.log

I've attached charts showing this behavior below:

Image Image

Data Used

I've attached the data used below:

gsm8k_with_newlines_first_125.txt

First Bad Commit

Commit 10d2af0

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions