How to finetune Yuan 3.0 on your local machine - Practical Guide

Jia Chen

13 Apr 2026 • 9 min read

We previously wrote about how to fine-tune Kimi 2.5 . We talked about Yuan 3.0 in depth in another post. This time we're tackling Yuan 3.0 Flash — a 40B-parameter MoE model that activates only 3.7B parameters per inference. It was built specifically for enterprise document workflows: RAG, table understanding, summarization, and multimodal document processing. Here's how to fine-tune it on your own hardware.

Why Fine-Tune Yuan 3.0?

Yuan 3.0 Flash already beats GPT-5.1 on enterprise RAG benchmarks, leads on complex table understanding, and outperforms every major model on text summarization. But the base model is general-purpose. Fine-tuning lets you specialize it for your specific domain — financial reports, legal contracts, medical records, insurance claims — and get dramatically better results on your exact use case.

The MoE architecture makes this practical. With only 3.7B parameters activating per inference, the computational profile is closer to a 7B model than a 40B model. And with the official Docker environment and Megatron-LM training scripts, the team has made fine-tuning surprisingly accessible.

Key specs:

Total Parameters: 40B (MoE)

Active Parameters: 3.7B per token

Context Length: 128K tokens

Architecture: MoE with 40 layers, 32 experts/layer, Top-2 routing

Vision Encoder: InternViT-300M

License: Apache 2.0 (commercial use, no authorization required)

Quantization: 4-bit (INT4) available

Hardware Requirements

Unlike Kimi K2.5 (which requires KTransformers to offload to CPU), Yuan 3.0 Flash fine-tuning uses Megatron-LM with standard GPU parallelism. The MoE architecture helps — but you still need multi-GPU setups.

For Supervised Fine-Tuning (SFT):

GPU: 8x A100 80GB (recommended) or 8x H100

CPU: 64+ cores

RAM: 512GB+

Storage: 500GB+ NVMe SSD

Network: High-bandwidth interconnect (NVLink/InfiniBand recommended for multi-node)

For Inference After Fine-Tuning:

GPU: 2x A100 80GB (BF16) or 1x A100 (INT4 quantized)

RAM: 256GB+

Don't have enterprise GPUs? The 4-bit quantized version (Yuan3.0-Flash-4bit on HuggingFace) can run inference on more modest hardware. For fine-tuning specifically, you can use cloud GPU instances — AWS p4d.24xlarge (8x A100) runs about $32/hour on-demand, or significantly less with spot instances.

Step 1: Set Up the Environment

Yuan 3.0 provides an official Docker image with all dependencies pre-configured. This is the recommended approach — it avoids the pain of matching CUDA, Megatron-LM, and framework versions manually.

docker pull yuanlabai/rlhf_yuan:v1.0

# Start the container
 docker run --gpus all -itd \
--network=host \
-v /path/to/yuan_3.0:/workspace/yuan_3.0 \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
--cap-add=IPC_LOCK \
--device=/dev/infiniband \
--privileged \
--name yuan3_finetune \
--ulimit core=0 \
--ulimit memlock=-1 \
--ulimit stack=68719476736 \
--shm-size=1000G \
yuanlabai/rlhf_yuan:v1.0

docker exec -it yuan3_finetune bash

The key volume mounts:

Model weights → /workspace/yuan_3.0
Training data → /workspace/dataset
Output directory → /workspace/checkpoints

Step 2: Download the Model

# Download Yuan3.0 Flash (BF16 — for fine-tuning)
huggingface-cli download YuanLabAI/Yuan3.0-Flash \
  --local-dir /path/to/yuan_3.0

# Or download the 4-bit quantized version (for inference only)
huggingface-cli download YuanLabAI/Yuan3.0-Flash-4bit \
  --local-dir /path/to/yuan_3.0_4bit

Storage note:

The full BF16 model is large. Make sure you have at least 500GB of fast NVMe storage.

The 4-bit version is smaller but cannot be used for fine-tuning—only inference.

Step 3: Prepare Your Training Data

{
  "reward_method": "llm_math",
  "language": "en",
  "data_source": "your_domain",
  "prompt": "[{'content': 'Your user prompt here', 'role': 'user'}]",
  "ability": "your_task_type",
  "reward_model": "{'ground_truth': 'expected_answer', 'style': 'rule'}",
  "extra_info": "{'answer': 'expected_answer', 'enable_thinking_flag': false, 'expect_len': 500, 'index': 0, 'question': 'Your question', 'split': 'train'}"
}

Mandatory fields explained

reward_method — Data category (e.g. llm_math, llm_code, rag)
language — "en" or "zh"
enable_thinking_flag — Whether reasoning mode is used
expect_len — Expected output length (guides training)

{
  "reward_method": "rag",
  "language": "en",
  "data_source": "financial_reports",
  "prompt": "[{'content': 'Based on the following financial report excerpt, what was the Q3 revenue?\\n\\n[document text here]', 'role': 'user'}]",
  "ability": "rag",
  "reward_model": "{'ground_truth': '$4.2 billion', 'style': 'rule'}",
  "extra_info": "{'answer': '$4.2 billion', 'enable_thinking_flag': false, 'expect_len': 200, 'index': 0, 'question': 'Q3 revenue extraction', 'split': 'train'}"
}

For multimodal data (images + text), set:

flag_image = 1

Data volume guidance:

100–200 → basic adaptation
500–2,000 → strong specialization
Quality > quantity

Step 4: Preprocess the Data

cd Yuan3.0/rlhf/verl

python examples/data_preprocess/data_preprocess_select_except_len.py \
  --input_path '<num_rows> /path/to/your_data.jsonl your_category thinking_flag' \
  --output_path '/workspace/dataset/processed/' \
  --split_type 'train' \
  --flag_image '0'

Note: input_path encodes multiple parameters in one line.

Set flag_image=1 for multimodal datasets.1

cd Yuan3.0/rlhf/megatron-lm

export CASE_CHECKPOINT_PATH=/workspace/checkpoints
export DATA_PATH='1 /workspace/dataset/processed/dataset'
export TOKENIZER_MODEL=/workspace/yuan_3.0/tokenizer.model
export CLIP_DOWNLOAD_PATH=/workspace/yuan_3.0/clip
export CHECKPOINT_PATH_LOAD=/workspace/yuan_3.0

bash examples/pretrain_yuan3.0_40B_sft.sh

The script already includes --finetune.

What to monitor:

Loss steadily decreasing → good
Plateau → need more/diverse data
Spikes → lower learning rate

Step 6: Reinforcement Learning with DAPO (Optional)

# Head node
RAY_USE_IP_ADDRESS=True ray start --head \
  --num-cpus=64 --num-gpus=8 --port=6400 \
  --memory=873741824000 \
  --dashboard-host 0.0.0.0 \
  --node-ip-address=${YOUR_HEAD_NODE_IP}

# Worker node
RAY_USE_IP_ADDRESS=True ray start \
  --num-cpus=64 --num-gpus=8 \
  --memory=873741824000 \
  --dashboard-host 0.0.0.0 \
  --address ${YOUR_HEAD_NODE_IP}:6400 \
  --node-ip-address=${YOUR_WORKER_NODE_IP}

Run DAPO

cd Yuan3.0/rlhf/verl
bash recipe/dapo/run_dapo_yuanvl_megatron_40B.sh

Strategies:

DAPO (default) — balanced, recommended
GSPO — tighter clipping
SAPO — tau-based strategy

Step 7: Deploy with vLLM

docker pull yuanlabai/vllm:v0.11.0

Includes handling for long prompts and filtering.

# Launch the vLLM container
docker run --gpus all --privileged \
--ulimit stack=68719476736 \
--shm-size=1000G -itd \
-v /path/to/your/finetuned_model:/workspace/model \
--name yuan3_serving yuanlabai/vllm:v0.11.0

docker exec -it yuan3_serving bash

# Start the API server (2 GPUs with tensor parallelism)
python -m vllm.entrypoints.openai.api_server \
--model=/workspace/model \
--port 8100 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 \
--trust-remote-code

Now you can query your fine-tuned model using the standard OpenAI client:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8100/v1",
)

response = client.chat.completions.create(
    model="/workspace/model",
    messages=[
        {
            "role": "user",
            "content": "Summarize the key terms in this contract...",
        }
    ],
    max_tokens=1024,
    temperature=0.1,
)

print(response.choices[0].message.content)

The vLLM server gives you an OpenAI-compatible endpoint, which means you can drop your fine-tuned Yuan 3.0 into any existing application that uses the OpenAI API format. No code changes needed beyond swapping the base URL.

Yuan 3.0 vs. Kimi 2.5: Fine-Tuning Compared

We previously wrote a guide on fine-tuning Kimi 2.5, so here is how the two compare for anyone deciding which model to customize.

Architecture: Kimi 2.5 uses a MoE (Mixture of Experts) architecture with 1T total parameters and 32B activated per token. Yuan 3.0 Flash uses a dense-to-MoE converted architecture with 226B total parameters and 40B activated. Yuan 3.0 activates more parameters per token, which generally means stronger per-token reasoning at the cost of higher compute.

Training framework: Kimi 2.5 uses a standard Hugging Face Transformers + DeepSpeed pipeline, which most ML engineers are already comfortable with. Yuan 3.0 uses Megatron-LM with custom SFT scripts and a verl-based DAPO framework. The learning curve is steeper, but you get more control over distributed training and RL alignment.

Hardware requirements: Both models need serious GPU resources for full fine-tuning. Kimi 2.5 can work with 4x A100 80GB for LoRA fine-tuning. Yuan 3.0 Flash recommends 8x A100/H100 80GB for full SFT via Megatron-LM. If you are GPU-constrained, Kimi 2.5 with LoRA is the lighter option.

Multimodal capability: This is where Yuan 3.0 pulls ahead. It natively handles images, tables, and documents as first-class inputs. Kimi 2.5 has vision capabilities but Yuan 3.0 was purpose-built for document understanding from the ground up. If your fine-tuning use case involves invoices, contracts, or structured documents, Yuan 3.0 is the stronger starting point.

RL alignment: Yuan 3.0 ships with a full DAPO reinforcement learning pipeline out of the box, supporting three different preference optimization strategies. Kimi 2.5 relies on standard RLHF approaches through third-party frameworks. If post-training alignment matters for your use case, Yuan 3.0 gives you more built-in tools.

Tips and Gotchas

Use Docker. Do not try to set up Megatron-LM from scratch. The official Docker images (yuanlabai/yuan3_finetune and yuanlabai/vllm:v0.11.0) bundle all dependencies correctly. Fighting with CUDA versions and custom NCCL builds is not worth your time.

Data format is strict. The JSON schema for training data must include all mandatory fields (reward_method, language, data_source, prompt, ability, reward_model, extra_info). Missing any of them will cause the preprocessing script to fail silently or produce empty parquet files. Validate your JSON before running the pipeline.

Start with SFT, skip RL initially. The DAPO reinforcement learning step is powerful but adds complexity and training time. For most use cases, SFT alone will get you 80-90% of the way there. Only add the RL stage once you've validated your SFT results and identified specific behavioral gaps that RL can address.

Monitor your checkpoints. Megatron-LM saves checkpoints at regular intervals. These are large (tens of GB each). Make sure your checkpoint directory has enough space and set up a cleanup strategy. The --save-interval flag in the SFT script controls how often checkpoints are written.

INT4 is inference-only. The 4-bit quantized model (Yuan3.0-Flash-4bit) cannot be used for fine-tuning. You need the full BF16 weights for SFT. The quantized version is only useful for running inference on your already-fine-tuned model with reduced hardware requirements.

Multimodal training needs flag_image. If your training data includes images alongside text, you must set flag_image to 1 in the preprocessing step. Forgetting this flag means the image paths in your data will be ignored and your model will train on text-only inputs even if images are present.

Alternative: Lightweight Fine-Tuning with Unsloth

The Megatron-LM pipeline described above is the official approach and gives you maximum control, but it requires serious hardware and a steep learning curve. If you want a faster path to fine-tuning Yuan 3.0 Flash with significantly lower hardware requirements, Unsloth is worth considering.

Unsloth is an open-source library that optimizes LoRA and QLoRA fine-tuning to run 2-5x faster than standard Hugging Face training while using significantly less VRAM. It achieves this through custom CUDA kernels, intelligent memory management, and optimized backpropagation that avoids redundant computation. Critically, Unsloth produces identical outputs to standard training – the speedup comes from implementation efficiency, not approximation.

For Yuan 3.0 Flash, using Unsloth with QLoRA means you could potentially fine-tune on a single A100 80GB or even a pair of consumer RTX 4090s, rather than needing the full 8x A100 cluster that Megatron-LM requires. The trade-off is that you are doing LoRA fine-tuning rather than full SFT, so you are training a small adapter on top of the base model rather than modifying all parameters. For most practical use cases, this is perfectly adequate.

Here is how to set up Unsloth for Yuan 3.0 Flash:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="YuanLabAI/Yuan3.0-Flash",
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Then set up your training data and configure the SFTTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

dataset = load_dataset("json", data_files="your_data.jsonl", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=4096,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()

After training completes, you can save and export your fine-tuned model in multiple formats:

# Save the LoRA adapter
model.save_pretrained("yuan3_lora_adapter")
tokenizer.save_pretrained("yuan3_lora_adapter")

# Merge LoRA weights and save full model
model.save_pretrained_merged("yuan3_merged", tokenizer, save_method="merged_16bit")

# Or export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("yuan3_gguf", tokenizer, quantization_method="q4_k_m")

When to use Unsloth vs. the official Megatron-LM pipeline:

Use Unsloth when you want fast iteration on a single GPU or a small multi-GPU setup, when LoRA/QLoRA is sufficient for your use case, or when you want to export directly to GGUF for local inference. Use the Megatron-LM pipeline when you need full-parameter SFT across a large GPU cluster, when you want to run the DAPO reinforcement learning stage, or when you need maximum control over the distributed training process.

What's Next?

Yuan 3.0 Flash is particularly well-suited for enterprise document workflows. If your organization processes financial reports, legal contracts, insurance claims, or medical records, a fine-tuned Yuan 3.0 model can deliver meaningfully better results than the base model on your specific document types. The multimodal architecture means you can train on document images directly, not just extracted text.

The model is released under the Yuan 3.0 open-source license, which permits commercial use. You can find the full model weights on Hugging Face at YuanLabAI/Yuan3.0-Flash and the complete training codebase on GitHub at Yuan-lab-LLM/Yuan3.0.

If you are already running the Kimi 2.5 fine-tuning pipeline we covered previously, Yuan 3.0 is worth evaluating side-by-side, especially for document-heavy tasks where its purpose-built architecture gives it a genuine edge.