Added AI server setup : to be reviewed

2026-01-21 19:10:32 +01:00
parent 30d560f804
commit b580137ee8
16 changed files with 4089 additions and 0 deletions
--- a/host/AIServerSetup/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md
+++ b/host/AIServerSetup/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md
@@ -0,0 +1,316 @@
+# Running DeepSeek-R1-0528 (FP8 Hybrid) with KTransformers
+
+This guide provides instructions to run the DeepSeek-R1-0528 model locally using a hybrid FP8 (GPU) and Q4_K_M GGUF (CPU) approach with KTransformers, managed via Docker. This setup is optimized for high-end hardware (e.g., NVIDIA RTX 4090, high-core count CPU, significant RAM).
+
+**Model Version:** DeepSeek-R1-0528
+**KTransformers Version (Working):** `approachingai/ktransformers:v0.2.4post1-AVX512`
+
+## Table of Contents
+
+1.  [Prerequisites](#prerequisites)
+2.  [Model Preparation](#model-preparation)
+    *   [Step 2a: Download FP8 Base Model (Host)](#step-2a-download-fp8-base-model-host)
+    *   [Step 2b: Download Q4\_K\_M GGUF Model (Host)](#step-2b-download-q4_k_m-gguf-model-host)
+    *   [Step 2c: Merge Models (Inside Docker)](#step-2c-merge-models-inside-docker)
+    *   [Step 2d: Set Ownership & Permissions (Host)](#step-2d-set-ownership--permissions-host)
+3.  [Running the Model with KTransformers](#running-the-model-with-ktransformers)
+    *   [Single GPU (e.g., 1x RTX 4090)](#single-gpu-eg-1x-rtx-4090)
+    *   [Multi-GPU (e.g., 2x RTX 4090)](#multi-gpu-eg-2x-rtx-4090)
+4.  [Testing the Server](#testing-the-server)
+5.  [Key Server Parameters](#key-server-parameters)
+6.  [Notes on KTransformers v0.3.1](#notes-on-ktransformers-v031)
+7.  [Available Optimize Config YAMLs (for reference)](#available-optimize-config-yamls-for-reference)
+8.  [Troubleshooting Tips](#troubleshooting-tips)
+
+---
+
+## 1. Prerequisites
+
+*   **Hardware:**
+    *   NVIDIA GPU with FP8 support (e.g., RTX 40-series, Hopper series).
+    *   High core-count CPU (e.g., Intel Xeon, AMD Threadripper).
+    *   Significant System RAM (ideally 512GB for larger GGUF experts and context). The Q4_K_M experts for a large model can consume 320GB+ alone.
+    *   Fast SSD (NVMe recommended) for model storage.
+*   **Software (on Host):**
+    *   Linux OS (Ubuntu 24.04 LTS recommended).
+    *   NVIDIA Drivers (ensure they are up-to-date and support your GPU and CUDA version).
+    *   Docker Engine.
+    *   NVIDIA Container Toolkit (for GPU access within Docker).
+    *   Conda or a Python virtual environment manager.
+    *   Python 3.9+
+    *   `huggingface_hub` and `hf_transfer`
+    *   Git (for cloning KTransformers if you need to inspect YAMLs or contribute).
+
+---
+
+## 2. Model Preparation
+
+We assume your models will be downloaded and stored under `/home/mukul/dev-ai/models` on your host system. This path will be mounted into the Docker container as `/models`. Adjust paths if your setup differs.
+
+### Step 2a: Download FP8 Base Model (Host)
+
+Download the official DeepSeek-R1-0528 FP8 base model components.
+
+
+```bash
+# Ensure that correct packages are installed. Conda is recommended for environemnt management.
+pip install -U huggingface_hub hf_transfer
+export HF_HUB_ENABLE_HF_TRANSFER=1 # For faster downloads
+```
+
+```bash
+# Define your host model directory
+HOST_MODEL_DIR="/home/mukul/dev-ai/models"
+BASE_MODEL_HF_ID="deepseek-ai/DeepSeek-R1-0528"
+LOCAL_BASE_MODEL_PATH="${HOST_MODEL_DIR}/${BASE_MODEL_HF_ID}"
+
+mkdir -p "${LOCAL_BASE_MODEL_PATH}"
+
+echo "Downloading base model to: ${LOCAL_BASE_MODEL_PATH}"
+huggingface-cli download --resume-download "${BASE_MODEL_HF_ID}" \
+  --local-dir "${LOCAL_BASE_MODEL_PATH}"```
+```
+
+### Step 2b: Download Q4_K_M GGUF Model (Host)
+
+Download the Unsloth Q4_K_M GGUF version of DeepSeek-R1-0528 using the attached python script.
+
+### Step 2c: Merge Models (Inside Docker)
+
+This step uses the KTransformers Docker image to merge the FP8 base and Q4\_K\_M GGUF weights.
+
+```bash
+docker stop ktransformers
+docker run --rm --gpus '"device=1"' \
+  -v /home/mukul/dev-ai/models:/models \
+  --name ktransformers \
+  -itd approachingai/ktransformers:v0.2.4post1-AVX512
+
+docker exec -it ktransformers /bin/bash
+```
+
+```bash
+python merge_tensors/merge_safetensor_gguf.py \
+  --safetensor_path /models/deepseek-ai/DeepSeek-R1-0528 \
+  --gguf_path /models/unsloth/DeepSeek-R1-0528-GGUF/Q4_K_M \
+  --output_path /models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8
+```
+
+
+### Step 2d: Set Ownership & Permissions (Host)
+
+After Docker creates the merged files, fix ownership and permissions on the host.
+
+```bash
+HOST_OUTPUT_DIR_QUANT="/home/mukul/dev-ai/models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8" # As defined above
+
+echo "Setting ownership for merged files in: ${HOST_OUTPUT_DIR_QUANT}"
+sudo chown -R $USER:$USER "${HOST_OUTPUT_DIR_QUANT}"
+sudo find "${HOST_OUTPUT_DIR_QUANT}" -type f -exec chmod 664 {} \;
+sudo find "${HOST_OUTPUT_DIR_QUANT}" -type d -exec chmod 775 {} \;
+
+echo "Ownership and permissions set. Verification:"
+ls -la "${HOST_OUTPUT_DIR_QUANT}"
+```
+
+---
+
+## 3. Running the Model with KTransformers
+
+Ensure the Docker image `approachingai/ktransformers:v0.2.4post1-AVX512` is pulled.
+
+### Single GPU (e.g., 1x RTX 4090)
+
+**1. Start Docker Container:**
+
+```bash
+# Stop any previous instance
+docker stop ktransformers || true # Allow if not running
+docker rm ktransformers || true   # Allow if not existing
+
+# Define your host model directory
+HOST_MODEL_DIR="/home/mukul/dev-ai/models"
+TARGET_GPU="1" # Specify GPU ID, e.g., "0", "1", or "all"
+
+docker run --rm --gpus "\"device=${TARGET_GPU}\"" \
+  -v "${HOST_MODEL_DIR}:/models" \
+  -p 10002:10002 \
+  --name ktransformers \
+  -itd approachingai/ktransformers:v0.2.4post1-AVX512
+
+docker exec -it ktransformers /bin/bash
+```
+
+**2. Inside the Docker container shell, launch the server:**
+
+```bash
+# Set environment variable for PyTorch CUDA memory allocation
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+CONTAINER_MERGED_MODEL_PATH="/models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8"
+CONTAINER_BASE_MODEL_CONFIG_PATH="/models/deepseek-ai/DeepSeek-R1-0528"
+
+# Launch server
+python3 ktransformers/server/main.py \
+    --gguf_path "${CONTAINER_MERGED_MODEL_PATH}" \
+    --model_path "${CONTAINER_BASE_MODEL_CONFIG_PATH}" \
+    --model_name KVCache-ai/DeepSeek-R1-0528-q4km-fp8 \
+    --cpu_infer 57 \
+    --max_new_tokens 16384 \
+    --cache_lens 24576 \
+    --cache_q4 true \
+    --temperature 0.6 \
+    --top_p 0.95 \
+    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml \
+    --force_think \
+    --use_cuda_graph \
+    --host 0.0.0.0 \
+    --port 10002
+```
+*Note: The `--optimize_config_path` still refers to a `DeepSeek-V3` YAML. This V3 config is compatible and recommended.
+
+### Multi-GPU (e.g., 2x RTX 4090)
+
+**1. Start Docker Container:**
+
+```bash
+# Stop any previous instance
+docker stop ktransformers || true
+docker rm ktransformers || true
+
+# Define your host model directory
+HOST_MODEL_DIR="/home/mukul/dev-ai/models"
+TARGET_GPUS="0,1" # Specify GPU IDs
+
+docker run --rm --gpus "\"device=${TARGET_GPUS}\"" \
+  -v "${HOST_MODEL_DIR}:/models" \
+  -p 10002:10002 \
+  --name ktransformers \
+  -itd approachingai/ktransformers:v0.2.4post1-AVX512
+
+docker exec -it ktransformers /bin/bash
+```
+
+**2. Inside the Docker container shell, launch the server:**
+```bash
+# Set environment variable (optional for multi-GPU, but can be helpful)
+# export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True 
+
+# Define container paths
+CONTAINER_MERGED_MODEL_PATH="/models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8"
+CONTAINER_BASE_MODEL_CONFIG_PATH="/models/deepseek-ai/DeepSeek-R1-0528"
+
+# Launch server
+python3 ktransformers/server/main.py \
+    --gguf_path "${CONTAINER_MERGED_MODEL_PATH}" \
+    --model_path "${CONTAINER_BASE_MODEL_CONFIG_PATH}" \
+    --model_name KVCache-ai/DeepSeek-R1-0528-q4km-fp8 \
+    --cpu_infer 57 \
+    --max_new_tokens 24576 \
+    --cache_lens 32768 \
+    --cache_q4 true \
+    --temperature 0.6 \
+    --top_p 0.95 \
+    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml \
+    --force_think \
+    --use_cuda_graph \
+    --host 0.0.0.0 \
+    --port 10002
+```
+*Note: The `--optimize_config_path` still refers to a `DeepSeek-V3` YAML. This is intentional.*
+
+---
+
+## 4. Testing the Server
+
+Once the server is running inside Docker (look for "Uvicorn running on http://0.0.0.0:10002"), open a **new terminal on your host machine** and test with `curl`:
+
+```bash
+curl http://localhost:10002/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "KVCache-ai/DeepSeek-R1-0528-q4km-fp8",
+        "messages": [{"role": "user", "content": "Explain the concept of Mixture of Experts in large language models in a simple way."}],
+        "max_tokens": 250,
+        "temperature": 0.6,
+        "top_p": 0.95
+    }'
+```
+A JSON response containing the model's output indicates success.
+
+---
+
+## 5. Key Server Parameters
+
+*   `--gguf_path`: Path inside the container to your **merged** hybrid model files.
+*   `--model_path`: Path inside the container to the **original base model's** directory (containing `config.json`, `tokenizer.json`, etc.). KTransformers needs this for model configuration.
+*   `--model_name`: Arbitrary name for the API endpoint. Used in client requests.
+*   `--cpu_infer`: Number of CPU threads for GGUF expert inference. Tune based on your CPU cores (e.g., `57` for a 56-core/112-thread CPU might leave some cores for other tasks, or you could try higher).
+*   `--max_new_tokens`: Maximum number of tokens the model can generate in a single response.
+*   `--cache_lens`: Maximum KV cache size in tokens. Directly impacts context length capacity and VRAM usage.
+*   `--cache_q4`: (Boolean) If `true`, quantizes the KV cache to 4-bit. **Crucial for saving VRAM**, especially with long contexts.
+*   `--temperature`, `--top_p`: Control generation randomness.
+*   `--optimize_config_path`: Path to the KTransformers YAML file defining the layer offloading strategy (FP8 on GPU, GGUF on CPU). **Essential for the hybrid setup.**
+*   `--force_think`: (KTransformers specific) Potentially related to how the model processes or plans.
+*   `--use_cuda_graph`: Enables CUDA graphs for potentially faster GPU execution by reducing kernel launch overhead.
+*   `--host`, `--port`: Network interface and port for the server.
+*   `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`: Environment variable to help PyTorch manage CUDA memory more flexibly and potentially avoid OOM errors.
+
+---
+
+## 6. Notes on KTransformers v0.3.1
+
+As of 2025-06-02, the `approachingai/ktransformers:v0.3.1-AVX512` image was reported as **not working** with the provided single GPU or multi-GPU configuration.
+
+**Attempted Docker Start Command (v0.3.1 - Non-Functional):**
+```bash
+# docker stop ktransformers # (if attempting to switch)
+# docker run --rm --gpus '"device=0,1"' \
+#   -v /home/mukul/dev-ai/models:/models \
+#   -p 10002:10002 \
+#   --name ktransformers \
+#   -itd approachingai/ktransformers:v0.3.1-AVX512
+#
+# docker exec -it ktransformers /bin/bash
+```
+
+**Attempted Server Launch (v0.3.1 - Non-Functional):**
+```bash
+# # Inside the v0.3.1 Docker container shell
+# PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py \
+#     --gguf_path /models/mukul/DeepSeek-R1-0528-GGML-FP8-Hybrid/Q4_K_M_FP8 \
+#     --model_path /models/deepseek-ai/DeepSeek-R1-0528 \
+#     --model_name KVCache-ai/DeepSeek-R1-0528-q4km-fp8 \
+#     --cpu_infer 57 \
+#     --max_new_tokens 32768 \
+#     --cache_lens 65536 \
+#     --cache_q4 true \
+#     --temperature 0.6 \
+#     --top_p 0.95 \
+#     --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml \
+#     --force_think \
+#     --use_cuda_graph \
+#     --host 0.0.0.0 \
+#     --port 10002
+```
+Stick to `approachingai/ktransformers:v0.2.4post1-AVX512` for the configurations described above until compatibility issues with newer versions are resolved for this specific model and setup.
+
+---
+
+## 7. Available Optimize Config YAMLs (for reference)
+
+The KTransformers repository contains various optimization YAML files. The ones used in this guide are for `DeepSeek-V3` but are being applied to `DeepSeek-R1-0528`. Their direct compatibility or optimality for R1-0528 should be verified. If KTransformers releases specific YAMLs for DeepSeek-R1-0528, those should be preferred.
+
+Reference list of some `DeepSeek-V3` YAMLs (path `ktransformers/optimize/optimize_rules/` inside the container):
+```
+DeepSeek-V3-Chat-amx.yaml
+DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
+DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve.yaml
+DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml
+DeepSeek-V3-Chat-multi-gpu-4.yaml
+DeepSeek-V3-Chat-multi-gpu-8.yaml
+DeepSeek-V3-Chat-multi-gpu-fp8-linear-ggml-experts.yaml
+DeepSeek-V3-Chat-multi-gpu-marlin.yaml
+DeepSeek-V3-Chat-multi-gpu.yaml
+DeepSeek-V3-Chat-serve.yaml
+DeepSeek-V3-Chat.yaml
+```
--- a/host/AIServerSetup/06-DeepSeek-R1-0528/download-gguf.py
+++ b/host/AIServerSetup/06-DeepSeek-R1-0528/download-gguf.py
@@ -0,0 +1,65 @@
+from huggingface_hub import hf_hub_download, list_repo_files # Import list_repo_files
+import os
+
+# Configuration
+repo_id = "unsloth/DeepSeek-R1-0528-GGUF"
+folder_in_repo = "Q4_K_M"
+file_extension = ".gguf"
+# Expand the tilde (~) to the user's home directory
+local_base_dir = os.path.expanduser("~/dev-ai/models/unsloth/DeepSeek-R1-0528-GGUF")
+
+# Create base directory
+# The hf_hub_download function will create the directory if it doesn't exist
+# when local_dir_use_symlinks=False. However, explicit creation is fine.
+os.makedirs(local_base_dir, exist_ok=True)
+
+# Download files
+print(f"Listing files from {repo_id} in folder {folder_in_repo} with extension {file_extension}...")
+try:
+    all_repo_files = list_repo_files(repo_id, repo_type='model')
+    files_to_download = [
+        f for f in all_repo_files
+        if f.startswith(folder_in_repo + "/") and f.endswith(file_extension)
+    ]
+
+    if not files_to_download:
+        print(f"No files found in '{folder_in_repo}' with extension '{file_extension}'.")
+    else:
+        print(f"Found {len(files_to_download)} file(s) to download.")
+
+    for filename_in_repo in files_to_download:
+        print(f"Downloading {filename_in_repo}...")
+        # The filename parameter in hf_hub_download should be the path within the repo
+        # The local_dir parameter specifies where the file (maintaining its repo path structure)
+        # will be saved under.
+        # For example, if filename_in_repo is "UD-Q4_K_XL/file.gguf",
+        # it will be saved as local_base_dir/UD-Q4_K_XL/file.gguf
+        try:
+            downloaded_file_path = hf_hub_download(
+                repo_id=repo_id,
+                filename=filename_in_repo, # This is the path of the file within the repository
+                local_dir=local_base_dir,
+                local_dir_use_symlinks=False,
+                # Set resume_download=True if you want to resume interrupted downloads
+                # resume_download=True,
+            )
+            # The hf_hub_download function returns the full path to the downloaded file.
+            # The way files are saved when local_dir is used can be tricky.
+            # If filename_in_repo is "folder/file.txt", it will be saved as "local_dir/folder/file.txt".
+            # If you want all files directly in local_base_dir without the repo's folder structure,
+            # you would need to adjust the local_dir or rename/move the file post-download.
+            # However, for GGUF files from a specific folder, saving them under that folder structure locally is usually fine.
+
+            print(f"Successfully downloaded and saved to: {downloaded_file_path}")
+            # If you want to confirm the exact path as per your original print statement's intent:
+            # expected_local_path = os.path.join(local_base_dir, filename_in_repo)
+            # print(f"Saved to: {expected_local_path}")
+
+
+        except Exception as e:
+            print(f"Error downloading {filename_in_repo}: {str(e)}")
+
+except Exception as e:
+    print(f"Error listing files from repository: {str(e)}")
+
+print("Download process complete.")