Files
bikinibottom/On host/AIServerSetup/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md

12 KiB

Running DeepSeek-R1-0528 (FP8 Hybrid) with KTransformers

This guide provides instructions to run the DeepSeek-R1-0528 model locally using a hybrid FP8 (GPU) and Q4_K_M GGUF (CPU) approach with KTransformers, managed via Docker. This setup is optimized for high-end hardware (e.g., NVIDIA RTX 4090, high-core count CPU, significant RAM).

Model Version: DeepSeek-R1-0528 KTransformers Version (Working): approachingai/ktransformers:v0.2.4post1-AVX512

Table of Contents

  1. Prerequisites
  2. Model Preparation
  3. Running the Model with KTransformers
  4. Testing the Server
  5. Key Server Parameters
  6. Notes on KTransformers v0.3.1
  7. Available Optimize Config YAMLs (for reference)
  8. Troubleshooting Tips

1. Prerequisites

  • Hardware:
    • NVIDIA GPU with FP8 support (e.g., RTX 40-series, Hopper series).
    • High core-count CPU (e.g., Intel Xeon, AMD Threadripper).
    • Significant System RAM (ideally 512GB for larger GGUF experts and context). The Q4_K_M experts for a large model can consume 320GB+ alone.
    • Fast SSD (NVMe recommended) for model storage.
  • Software (on Host):
    • Linux OS (Ubuntu 24.04 LTS recommended).
    • NVIDIA Drivers (ensure they are up-to-date and support your GPU and CUDA version).
    • Docker Engine.
    • NVIDIA Container Toolkit (for GPU access within Docker).
    • Conda or a Python virtual environment manager.
    • Python 3.9+
    • huggingface_hub and hf_transfer
    • Git (for cloning KTransformers if you need to inspect YAMLs or contribute).

2. Model Preparation

We assume your models will be downloaded and stored under /home/mukul/dev-ai/models on your host system. This path will be mounted into the Docker container as /models. Adjust paths if your setup differs.

Step 2a: Download FP8 Base Model (Host)

Download the official DeepSeek-R1-0528 FP8 base model components.

# Ensure that correct packages are installed. Conda is recommended for environemnt management.
pip install -U huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1 # For faster downloads
# Define your host model directory
HOST_MODEL_DIR="/home/mukul/dev-ai/models"
BASE_MODEL_HF_ID="deepseek-ai/DeepSeek-R1-0528"
LOCAL_BASE_MODEL_PATH="${HOST_MODEL_DIR}/${BASE_MODEL_HF_ID}"

mkdir -p "${LOCAL_BASE_MODEL_PATH}"

echo "Downloading base model to: ${LOCAL_BASE_MODEL_PATH}"
huggingface-cli download --resume-download "${BASE_MODEL_HF_ID}" \
  --local-dir "${LOCAL_BASE_MODEL_PATH}"```

Step 2b: Download Q4_K_M GGUF Model (Host)

Download the Unsloth Q4_K_M GGUF version of DeepSeek-R1-0528 using the attached python script.

Step 2c: Merge Models (Inside Docker)

This step uses the KTransformers Docker image to merge the FP8 base and Q4_K_M GGUF weights.

docker stop ktransformers
docker run --rm --gpus '"device=1"' \
  -v /home/mukul/dev-ai/models:/models \
  --name ktransformers \
  -itd approachingai/ktransformers:v0.2.4post1-AVX512

docker exec -it ktransformers /bin/bash
python merge_tensors/merge_safetensor_gguf.py \
  --safetensor_path /models/deepseek-ai/DeepSeek-R1-0528 \
  --gguf_path /models/unsloth/DeepSeek-R1-0528-GGUF/Q4_K_M \
  --output_path /models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8

Step 2d: Set Ownership & Permissions (Host)

After Docker creates the merged files, fix ownership and permissions on the host.

HOST_OUTPUT_DIR_QUANT="/home/mukul/dev-ai/models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8" # As defined above

echo "Setting ownership for merged files in: ${HOST_OUTPUT_DIR_QUANT}"
sudo chown -R $USER:$USER "${HOST_OUTPUT_DIR_QUANT}"
sudo find "${HOST_OUTPUT_DIR_QUANT}" -type f -exec chmod 664 {} \;
sudo find "${HOST_OUTPUT_DIR_QUANT}" -type d -exec chmod 775 {} \;

echo "Ownership and permissions set. Verification:"
ls -la "${HOST_OUTPUT_DIR_QUANT}"

3. Running the Model with KTransformers

Ensure the Docker image approachingai/ktransformers:v0.2.4post1-AVX512 is pulled.

Single GPU (e.g., 1x RTX 4090)

1. Start Docker Container:

# Stop any previous instance
docker stop ktransformers || true # Allow if not running
docker rm ktransformers || true   # Allow if not existing

# Define your host model directory
HOST_MODEL_DIR="/home/mukul/dev-ai/models"
TARGET_GPU="1" # Specify GPU ID, e.g., "0", "1", or "all"

docker run --rm --gpus "\"device=${TARGET_GPU}\"" \
  -v "${HOST_MODEL_DIR}:/models" \
  -p 10002:10002 \
  --name ktransformers \
  -itd approachingai/ktransformers:v0.2.4post1-AVX512

docker exec -it ktransformers /bin/bash

2. Inside the Docker container shell, launch the server:

# Set environment variable for PyTorch CUDA memory allocation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CONTAINER_MERGED_MODEL_PATH="/models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8"
CONTAINER_BASE_MODEL_CONFIG_PATH="/models/deepseek-ai/DeepSeek-R1-0528"

# Launch server
python3 ktransformers/server/main.py \
    --gguf_path "${CONTAINER_MERGED_MODEL_PATH}" \
    --model_path "${CONTAINER_BASE_MODEL_CONFIG_PATH}" \
    --model_name KVCache-ai/DeepSeek-R1-0528-q4km-fp8 \
    --cpu_infer 57 \
    --max_new_tokens 16384 \
    --cache_lens 24576 \
    --cache_q4 true \
    --temperature 0.6 \
    --top_p 0.95 \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml \
    --force_think \
    --use_cuda_graph \
    --host 0.0.0.0 \
    --port 10002

*Note: The --optimize_config_path still refers to a DeepSeek-V3 YAML. This V3 config is compatible and recommended.

Multi-GPU (e.g., 2x RTX 4090)

1. Start Docker Container:

# Stop any previous instance
docker stop ktransformers || true
docker rm ktransformers || true

# Define your host model directory
HOST_MODEL_DIR="/home/mukul/dev-ai/models"
TARGET_GPUS="0,1" # Specify GPU IDs

docker run --rm --gpus "\"device=${TARGET_GPUS}\"" \
  -v "${HOST_MODEL_DIR}:/models" \
  -p 10002:10002 \
  --name ktransformers \
  -itd approachingai/ktransformers:v0.2.4post1-AVX512

docker exec -it ktransformers /bin/bash

2. Inside the Docker container shell, launch the server:

# Set environment variable (optional for multi-GPU, but can be helpful)
# export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True 

# Define container paths
CONTAINER_MERGED_MODEL_PATH="/models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8"
CONTAINER_BASE_MODEL_CONFIG_PATH="/models/deepseek-ai/DeepSeek-R1-0528"

# Launch server
python3 ktransformers/server/main.py \
    --gguf_path "${CONTAINER_MERGED_MODEL_PATH}" \
    --model_path "${CONTAINER_BASE_MODEL_CONFIG_PATH}" \
    --model_name KVCache-ai/DeepSeek-R1-0528-q4km-fp8 \
    --cpu_infer 57 \
    --max_new_tokens 24576 \
    --cache_lens 32768 \
    --cache_q4 true \
    --temperature 0.6 \
    --top_p 0.95 \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml \
    --force_think \
    --use_cuda_graph \
    --host 0.0.0.0 \
    --port 10002

Note: The --optimize_config_path still refers to a DeepSeek-V3 YAML. This is intentional.


4. Testing the Server

Once the server is running inside Docker (look for "Uvicorn running on http://0.0.0.0:10002"), open a new terminal on your host machine and test with curl:

curl http://localhost:10002/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "KVCache-ai/DeepSeek-R1-0528-q4km-fp8",
        "messages": [{"role": "user", "content": "Explain the concept of Mixture of Experts in large language models in a simple way."}],
        "max_tokens": 250,
        "temperature": 0.6,
        "top_p": 0.95
    }'

A JSON response containing the model's output indicates success.


5. Key Server Parameters

  • --gguf_path: Path inside the container to your merged hybrid model files.
  • --model_path: Path inside the container to the original base model's directory (containing config.json, tokenizer.json, etc.). KTransformers needs this for model configuration.
  • --model_name: Arbitrary name for the API endpoint. Used in client requests.
  • --cpu_infer: Number of CPU threads for GGUF expert inference. Tune based on your CPU cores (e.g., 57 for a 56-core/112-thread CPU might leave some cores for other tasks, or you could try higher).
  • --max_new_tokens: Maximum number of tokens the model can generate in a single response.
  • --cache_lens: Maximum KV cache size in tokens. Directly impacts context length capacity and VRAM usage.
  • --cache_q4: (Boolean) If true, quantizes the KV cache to 4-bit. Crucial for saving VRAM, especially with long contexts.
  • --temperature, --top_p: Control generation randomness.
  • --optimize_config_path: Path to the KTransformers YAML file defining the layer offloading strategy (FP8 on GPU, GGUF on CPU). Essential for the hybrid setup.
  • --force_think: (KTransformers specific) Potentially related to how the model processes or plans.
  • --use_cuda_graph: Enables CUDA graphs for potentially faster GPU execution by reducing kernel launch overhead.
  • --host, --port: Network interface and port for the server.
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True: Environment variable to help PyTorch manage CUDA memory more flexibly and potentially avoid OOM errors.

6. Notes on KTransformers v0.3.1

As of 2025-06-02, the approachingai/ktransformers:v0.3.1-AVX512 image was reported as not working with the provided single GPU or multi-GPU configuration.

Attempted Docker Start Command (v0.3.1 - Non-Functional):

# docker stop ktransformers # (if attempting to switch)
# docker run --rm --gpus '"device=0,1"' \
#   -v /home/mukul/dev-ai/models:/models \
#   -p 10002:10002 \
#   --name ktransformers \
#   -itd approachingai/ktransformers:v0.3.1-AVX512
#
# docker exec -it ktransformers /bin/bash

Attempted Server Launch (v0.3.1 - Non-Functional):

# # Inside the v0.3.1 Docker container shell
# PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py \
#     --gguf_path /models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8 \
#     --model_path /models/deepseek-ai/DeepSeek-R1-0528 \
#     --model_name KVCache-ai/DeepSeek-R1-0528-q4km-fp8 \
#     --cpu_infer 57 \
#     --max_new_tokens 32768 \
#     --cache_lens 65536 \
#     --cache_q4 true \
#     --temperature 0.6 \
#     --top_p 0.95 \
#     --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml \
#     --force_think \
#     --use_cuda_graph \
#     --host 0.0.0.0 \
#     --port 10002

Stick to approachingai/ktransformers:v0.2.4post1-AVX512 for the configurations described above until compatibility issues with newer versions are resolved for this specific model and setup.


7. Available Optimize Config YAMLs (for reference)

The KTransformers repository contains various optimization YAML files. The ones used in this guide are for DeepSeek-V3 but are being applied to DeepSeek-R1-0528. Their direct compatibility or optimality for R1-0528 should be verified. If KTransformers releases specific YAMLs for DeepSeek-R1-0528, those should be preferred.

Reference list of some DeepSeek-V3 YAMLs (path ktransformers/optimize/optimize_rules/ inside the container):

DeepSeek-V3-Chat-amx.yaml
DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve.yaml
DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml
DeepSeek-V3-Chat-multi-gpu-4.yaml
DeepSeek-V3-Chat-multi-gpu-8.yaml
DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml
DeepSeek-V3-Chat-multi-gpu-marlin.yaml
DeepSeek-V3-Chat-multi-gpu.yaml
DeepSeek-V3-Chat-serve.yaml
DeepSeek-V3-Chat.yaml