TurboQuant + llama.cpp su ROCm

⚠️

Stato: Il supporto TurboQuant su ROCm è sperimentale (fork jagsan-cyber/turboquant-rocm-llamacpp, aprile 2026). Non è ancora nel main di llama.cpp. Aspettati possibili crash e performance variabile. Testa sempre su un sistema non di produzione.

Step 0

Prerequisiti hardware & software

Prima di iniziare, verifica che il tuo sistema soddisfi i requisiti minimi. Le GPU AMD supportate da ROCm partono da RDNA2 (gfx1030+).

🖥️ GPU AMD

RDNA2, RDNA3 o RDNA4. Architetture gfx1030, gfx110x, gfx120x, Strix Halo gfx1151.

≥ RX 6000 series

🐧 Linux Distro

Ubuntu 22.04 o 24.04 LTS, Fedora 39+. Kernel 6.x raccomandato.

Ubuntu 22.04 / 24.04

⚙️ ROCm

AMD ROCm runtime e HIP SDK installati. Versione 6.0 o superiore.

ROCm ≥ 6.0

📦 CMake

CMake 3.21 o superiore, con supporto HIP targets.

cmake ≥ 3.21

🔧 Build tools

gcc/g++ 12+, clang (dal pacchetto ROCm), make, git, python3.

gcc ≥ 12

💾 VRAM

Minimo 8 GB VRAM per modelli 7B. 16 GB+ per modelli 13–34B.

≥ 8 GB VRAM

ℹ️

Verifica la tua architettura AMD con: rocminfo | grep gfx. Nota il valore (es. gfx1100 per RX 7900) — ti servirà nella fase di build come AMDGPU_TARGETS.

Build Guide

Compilazione passo per passo

Segui gli step nell'ordine. Ogni blocco di codice è copiabile con un click.

1 · Dipendenze

2 · ROCm Check

3 · Clone Fork

4 · CMake Build

5 · Quantize

6 · Run Test

Installa le dipendenze di sistema

Installa i pacchetti necessari. Su Ubuntu 22.04/24.04:

bash

# Aggiorna e installa build tools
sudo apt update && sudo apt upgrade -y
sudo apt install -y   build-essential gcc g++ clang   cmake cmake-extras   git wget curl python3 python3-pip   libopenblas-dev   pkg-config

Su Fedora/RHEL:

bash

sudo dnf install -y   gcc gcc-c++ clang cmake git wget curl   python3 python3-pip openblas-devel

Installa e verifica ROCm

Se ROCm non è ancora installato, usa lo script ufficiale AMD. Se già installato, salta alla verifica.

bash

# Scarica e installa ROCm 6.x (Ubuntu 22.04)
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo dpkg -i amdgpu-install_6.3.60300-1_all.deb
sudo amdgpu-install --usecase=rocm,hip --no-dkms

# Aggiungi il tuo utente al gruppo render e video
sudo usermod -aG render,video $USER
newgrp render

# Riavvia per applicare i cambiamenti
sudo reboot

bash

# Verifica che ROCm rilevi la GPU
rocminfo

# Trova la tua architettura target (es. gfx1100)
rocminfo | grep -E "gfx|Name"

# Verifica HIP
hipconfig -v
hipconfig -l   # mostra path del compilatore

# Test rapido: elenca device
rocm-smi

✅

Se rocminfo mostra la tua GPU con architettura gfx****, ROCm è correttamente installato. Annota il valore — es. gfx1100 per RX 7900 XTX.

Clona il fork TurboQuant ROCm

Usa il fork jagsan-cyber/turboquant-rocm-llamacpp che include la patch TurboQuant con backend ROCm/HIP. Contiene anche la Heavy-Hitter Oracle (H2O) per KV cache eviction.

bash

# Crea una directory di lavoro
mkdir -p ~/llm-inference && cd ~/llm-inference

# Clona il fork TurboQuant con supporto ROCm
git clone https://github.com/jagsan-cyber/turboquant-rocm-llamacpp.git
cd turboquant-rocm-llamacpp

# Mostra il branch corrente e i log recenti
git log --oneline -10
git branch -a

💡

In alternativa puoi usare il fork TheTom/llama-cpp-turboquant (CUDA + CPU) e applicare manualmente i patch ROCm dal repo jagsan-cyber. Vedi la sezione Troubleshooting.

Configura e compila con CMake + HIP

Questa è la fase critica. Sostituisci gfx1100 con l'architettura della tua GPU rilevata al passo 2. Se hai più GPU, separa con punto e virgola: gfx1100;gfx1030.

bash

# Imposta la tua architettura GPU (modifica questo valore!)
# Esempi: gfx1030 (RX 6800), gfx1100 (RX 7900 XTX), gfx1101 (RX 7600)
#         gfx1151 (Strix Halo / Ryzen AI MAX)
export AMDGPU_TARGETS="gfx1100"   # ← modifica qui

# Imposta variabili HIP dal runtime ROCm
export HIPCXX="$(hipconfig -l)/clang"
export HIP_PATH="$(hipconfig -R)"

# Crea la directory di build e configura
cmake -S . -B build   -DGGML_HIP=ON   -DGGML_HIP_TURBOQUANT=ON   -DAMDGPU_TARGETS=$AMDGPU_TARGETS   -DCMAKE_BUILD_TYPE=Release   -DGGML_BLAS=ON   -DGGML_BLAS_VENDOR=hipBLAS

# Compila (usa tutti i core disponibili)
cmake --build build -j$(nproc)

# Verifica i binari prodotti
ls -la build/bin/

⚠️

Se il flag -DGGML_HIP_TURBOQUANT=ON non viene riconosciuto (CMake error), consulta il README del fork: alcune versioni usano -DLLAMA_TURBOQUANT=ON o richiede un branch specifico. Controlla con git branch -a.

Flag CMake opzionali avanzati

bash (opzionale)

# Flag aggiuntivi per ottimizzazioni specifiche
cmake -S . -B build   -DGGML_HIP=ON   -DGGML_HIP_TURBOQUANT=ON   -DAMDGPU_TARGETS=$AMDGPU_TARGETS   -DCMAKE_BUILD_TYPE=Release   -DGGML_BLAS=ON   -DGGML_BLAS_VENDOR=hipBLAS   -DGGML_NATIVE=ON \            # ottimizza per la CPU locale
  -DGGML_F16C=ON \            # abilita FP16 se supportato
  -DGGML_AVX2=ON \            # AVX2 per CPU Ryzen
  -DLLAMA_CURL=ON              # abilita download diretto modelli

Scarica e quantizza un modello

Per testare TurboQuant hai bisogno di un modello in formato GGUF. Puoi scaricarne uno pre-quantizzato da HuggingFace o usarne uno esistente.

bash

# Crea directory per i modelli
mkdir -p ~/llm-inference/models

# Opzione A: scarica un GGUF già pronto (es. Qwen2.5-7B Q4_K_M)
cd ~/llm-inference/models
wget -c "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"

# Opzione B: converti da safetensors (se hai il modello HF locale)
cd ~/llm-inference/turboquant-rocm-llamacpp
python3 convert_hf_to_gguf.py /path/to/hf/model   --outfile ~/llm-inference/models/mio-modello-f16.gguf   --outtype f16

# Quantizzazione classica Q4_K_M (opzionale, se hai il modello F16)
./build/bin/llama-quantize   ~/llm-inference/models/mio-modello-f16.gguf   ~/llm-inference/models/mio-modello-q4km.gguf   Q4_K_M

Esegui il test TurboQuant KV cache

Avvia llama-cli o llama-bench con i flag TurboQuant attivi. Il parametro chiave è --cache-type-k e --cache-type-v impostati a tq (TurboQuant).

bash

# Inferenza con TurboQuant KV cache attivo su GPU AMD
./build/bin/llama-cli   -m ~/llm-inference/models/qwen2.5-7b-instruct-q4_k_m.gguf   -ngl 99 \                   # offload tutti i layer su GPU
  --cache-type-k tq1_0 \   # TurboQuant ~1-bit keys
  --cache-type-v tq4_0 \   # TurboQuant 4-bit values
  -c 16384 \                # context window 16K token
  -n 256 \                  # genera 256 token
  -p "Ciao! Spiegami cos'è il KV cache in un LLM."

💡

Tipi KV cache disponibili: f16 (standard), q8_0, q4_0, tq1_0 (TurboQuant ~1-bit), tq4_0 (TurboQuant 4-bit). Combinare tq1_0 per K e tq4_0 per V dà il miglior rapporto qualità/memoria.

bash

# Benchmark comparativo: standard f16 vs TurboQuant

# Test 1: KV cache standard F16 (baseline)
./build/bin/llama-bench   -m ~/llm-inference/models/qwen2.5-7b-instruct-q4_k_m.gguf   -ngl 99   --cache-type-k f16   --cache-type-v f16   -c 8192   -n 128   -r 3              # ripeti 3 volte per media

# Test 2: TurboQuant attivo
./build/bin/llama-bench   -m ~/llm-inference/models/qwen2.5-7b-instruct-q4_k_m.gguf   -ngl 99   --cache-type-k tq1_0   --cache-type-v tq4_0   -c 8192   -n 128   -r 3

# Monitor VRAM in tempo reale (apri in un secondo terminale)
watch -n 0.5 rocm-smi --showmeminfo vram

bash

# Server OpenAI-compatibile con TurboQuant abilitato
./build/bin/llama-server   -m ~/llm-inference/models/qwen2.5-7b-instruct-q4_k_m.gguf   -ngl 99   --cache-type-k tq1_0   --cache-type-v tq4_0   -c 32768 \         # context 32K — qui TurboQuant fa la differenza!
  --host 0.0.0.0   --port 8080

# Test rapido con curl (in un altro terminale)
curl -s http://localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{"model":"local","messages":[{"role":"user","content":"Ciao!"}],"max_tokens":100}'   | python3 -m json.tool

Risultati attesi

Cosa aspettarsi su ROCm

Benchmark indicativi basati su test della community (aprile 2026). I valori ROCm sono stimati a partire dai dati CUDA applicando il gap storico di ~15-25%.

Config KV cache	VRAM (7B, 8K ctx)	TPS decode	Qualità output	Stato ROCm
`f16 / f16` (baseline)	~6.5 GB	100% (ref)	Perfetta	Stabile
`q8_0 / q8_0`	~5.2 GB (−20%)	~105%	Ottima	Stabile
`q4_0 / q4_0`	~3.8 GB (−42%)	~108%	Buona	Stabile
`tq4_0 / tq4_0`	~2.4 GB (−63%)	~96%	Buona	Sperimentale
`tq1_0 / tq4_0` (max compression)	~1.1 GB (−83%)	~85-92%	Discreta	Sperimentale

* I dati ROCm sono stime. Su Metal (Apple Silicon) si registra un calo TPS maggiore (~50%) per un bug noto. CUDA mostra +22.8% decode a 32K context.

Problemi comuni

Troubleshooting ROCm + TurboQuant

Errori frequenti e come risolverli.

❌ Errore: HIPCC not found o hipconfig: command not found

ROCm non è nel PATH o non è installato correttamente.

bash

# Aggiungi ROCm al PATH
echo 'export PATH=$PATH:/opt/rocm/bin:/opt/rocm/hip/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib' >> ~/.bashrc
source ~/.bashrc

# Verifica
which hipconfig && hipconfig -v

❌ Errore: GPU not found o Device 0: gfx000

L'utente non ha i permessi sul device o il driver non carica correttamente.

bash

# Controlla che l'utente sia nel gruppo render/video
groups $USER

# Se mancano, aggiungili e rilogga
sudo usermod -aG render,video $USER
logout   # poi rilogga

# Verifica device files
ls -la /dev/kfd /dev/dri/renderD*

❌ Errore CMake: DGGML_HIP_TURBOQUANT not recognized

Alcune versioni del fork usano nomi di flag diversi. Prova queste alternative:

bash

# Alternativa 1
cmake -S . -B build -DGGML_HIP=ON -DLLAMA_TURBOQUANT=ON ...

# Alternativa 2: cerca il flag corretto nel CMakeLists
grep -i "turboquant" CMakeLists.txt
grep -i "turboquant" ggml/CMakeLists.txt

⚠️ Performance peggiori del previsto (TPS molto bassi)

Su ROCm il backend TurboQuant non è ancora ottimizzato quanto CUDA. Prova queste mitigazioni:

bash

# Usa tq4_0 invece di tq1_0 per k — meno compressione, più veloce
--cache-type-k tq4_0 --cache-type-v tq4_0

# Riduci il context se la VRAM non è un problema
-c 4096

# Forza un numero specifico di thread CPU per il dequantize
-t 8

# Monitora l'utilizzo GPU in tempo reale
rocm-smi --showuse --showmeminfo vram -d 0

🔄 Alternativa: applicare patch manualmente su llama.cpp ufficiale

Se preferisci partire dal repo ufficiale llama.cpp e applicare la patch TurboQuant manualmente:

bash

# Clona llama.cpp ufficiale
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# Aggiungi il fork come remote
git remote add thetom https://github.com/TheTom/llama-cpp-turboquant.git
git fetch thetom

# Cherry-pick solo i commit TurboQuant (vedi git log thetom/main)
git log thetom/main --oneline | grep -i turboquant
# poi: git cherry-pick <commit-hash>

# Build con ROCm standard
cmake -S . -B build   -DGGML_HIP=ON   -DAMDGPU_TARGETS=$AMDGPU_TARGETS   -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

llama.cpp + TurboQuantsu AMD ROCm

Prerequisiti hardware & software

Compilazione passo per passo

Installa le dipendenze di sistema

Installa e verifica ROCm

Clona il fork TurboQuant ROCm

Configura e compila con CMake + HIP

Scarica e quantizza un modello

Esegui il test TurboQuant KV cache

Cosa aspettarsi su ROCm

Troubleshooting ROCm + TurboQuant

llama.cpp + TurboQuant
su AMD ROCm