Misc. bug: Finetuning yields different and worse results using CPU backend vs. CUDA backend

### Name and Version

$./build/bin/llama-cli --version
version: 5358 (10d2af0e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

### Command line

```shell
CPU: ./build/bin/llama-finetune --file ./gsm8k_with_newlines_first_125.txt --model smollm2-135M-base.gguf -b 512 -ub 512 -np 1 --device none

CUDA: CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-finetune --file ./gsm8k_with_newlines_first_125.txt --model smollm2-135M-base.gguf -b 512 -ub 512 -np 1 -ngl 999
```

### Problem description & steps to reproduce

Hey all,

I've noticed that `examples/training/finetune.cpp` code yields different and worse results when using the CPU backend.

As a minimal example to show this, I used `finetune.cpp` in the following situation:
- Model: SmolLM2-135M Base (https://huggingface.co/HuggingFaceTB/SmolLM2-135M). This was converted to GGUF using `llama.cpp/convert_hf_to_gguf.py` specifying `f32` as the `--outtype`.
- Dataset: First 125 samples from GSM8K (https://huggingface.co/datasets/openai/gsm8k), stored in a single newline-delimited file (attached below). This small subset enables quick testing, since CPU training can be quite slow.
- Hyperparameters: Defaults used by the examples/training code - except the following: 
  - Epochs to train for: Modified from 2 -> 10 
  - Learning rate upped from 1e-7 -> 1e-6

I would expect SmolLM2-135M Base to quickly learn on (and maybe overfit) this small amount of data. Finetuning using the CUDA backend shows this is the case, with training loss decreasing over time.

I would expect similar behavior when finetuning using the CPU backend given that this is the same model, same dataset, and same hyperparameters. The only difference is the backend. However, this is showing that models finetuned with the CPU aren't learning. 

I believe there's a bug in the CPU backend of the finetuning code specifically that is causing this discrepancy. 

### Results

I've attached the logfiles from running below: 

[llamacpp_finetune_smollm2135mbase_on_gsmk_125samples_cpu.log](https://github.com/user-attachments/files/22127473/llamacpp_finetune_smollm2135mbase_on_gsmk_125samples_cpu.log)
[llamacpp_finetune_smollm2135mbase_on_gsmk_125samples_gpu.log](https://github.com/user-attachments/files/22127472/llamacpp_finetune_smollm2135mbase_on_gsmk_125samples_gpu.log)

I've attached charts showing this behavior below: 

<img width="4455" height="2864" alt="Image" src="https://github.com/user-attachments/assets/64361eec-566f-4acc-9d20-2dac62047d6b" />
<img width="4457" height="2864" alt="Image" src="https://github.com/user-attachments/assets/6fd26cb4-e4b4-4912-992f-3d9740435bae" />

### Data Used

I've attached the data used below: 

[gsm8k_with_newlines_first_125.txt](https://github.com/user-attachments/files/22127323/gsm8k_with_newlines_first_125.txt)

### First Bad Commit

Commit 10d2af0

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Finetuning yields different and worse results using CPU backend vs. CUDA backend #15779

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Results

Data Used

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Finetuning yields different and worse results using CPU backend vs. CUDA backend #15779

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Results

Data Used

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions