LLM

What is Gemma 4 and how to use, finetune it

Jia Chen

14 Apr 2026 • 12 min read

Google just dropped Gemma 4, calling it their most capable family of open models to date. Built from the same research behind Gemini, these models pack serious multimodal intelligence into packages small enough to run on your phone and large enough to compete with frontier models on a server. If you’ve been following the open-weight model space, this is a big deal — and not just because of the benchmarks.

What is Gemma 4?

Gemma 4 is a family of generative AI models from Google DeepMind, released with open weights for commercial use. The family spans four model sizes across three distinct architectures, each targeting different hardware environments.

The smallest models are the E2B and E4B (“E” stands for effective parameters). These use a novel Per-Layer Embedding (PLE) architecture designed for ultra-mobile, edge, and browser deployment — think Pixel phones and Chrome. The 31B dense model is a traditional dense architecture that bridges server-grade performance with local execution. And the 26B A4B is a Mixture-of-Experts (MoE) model that only activates 4 billion parameters per token while keeping all 26 billion loaded for fast routing. This makes it highly efficient for throughput-intensive workloads.

All Gemma 4 models are multimodal, handling text and image input natively. The smaller E2B and E4B models also support audio and video input. Context windows range from 128K tokens for the small models up to 256K for the medium ones. Every model includes configurable thinking modes for reasoning tasks, native system prompt support, and built-in function-calling for agentic workflows.

Is it truly open source? What about the license?

This is where Gemma 4 makes a statement. Google released the entire family under the Apache 2.0 license. This is a genuine open-source license — one of the most permissive in the software world. You can download the weights, modify them, fine-tune them, deploy them commercially, and redistribute them. No strings attached, no usage restrictions, no phone-home requirements.

This is a significant shift from previous Gemma releases, which used Google’s custom “Gemma Terms of Use” that imposed certain restrictions. Apache 2.0 puts Gemma 4 on equal footing with the most permissive open models out there.

How does this compare to DeepSeek?

DeepSeek’s models (like DeepSeek-V3 and R1) are also released under open licenses, typically the MIT License, which is similarly permissive. Both Apache 2.0 and MIT allow commercial use, modification, and redistribution. The practical difference between them is minimal for most developers — Apache 2.0 includes an explicit patent grant, which provides some additional legal protection, while MIT is slightly simpler in its terms.

The key point is that both Gemma 4 and DeepSeek models are genuinely open in a way that matters for production use. Neither requires you to share your modifications, neither restricts your use case, and neither charges licensing fees. This puts them in a fundamentally different category from models that use restrictive community licenses or that only provide API access.

What hardware do you actually need?

This is one of the most practical questions, and the answer varies dramatically depending on which model you choose and what precision you run at. Here’s the breakdown based on Google’s published memory requirements:

Gemma 4 E2B needs about 9.6 GB of VRAM at full 16-bit precision, 4.6 GB at 8-bit, and just 3.2 GB at 4-bit quantization. This means you can run the smallest Gemma 4 model on a laptop GPU or even a decent integrated GPU setup. Some users on Reddit have reported running it with as little as 6 GB of RAM using quantized GGUF checkpoints via llama.cpp.

Gemma 4 E4B needs about 15 GB at 16-bit, 7.5 GB at 8-bit, and 5 GB at 4-bit. An NVIDIA RTX 3060 12 GB or RTX 4060 Ti would handle the quantized versions comfortably.

Gemma 4 31B is the heavyweight. At full precision it needs about 58 GB of VRAM, which means you’re looking at multiple high-end GPUs or a single A100 80 GB. At 4-bit quantization it drops to about 17.4 GB, which fits on a single RTX 4090 or RTX 3090.

Gemma 4 26B A4B (MoE) needs about 48 GB at full precision and 15.6 GB at 4-bit. Despite only activating 4 billion parameters per token, all 26 billion must be loaded into memory for the routing mechanism to work. At 4-bit quantization, it’s manageable on a single RTX 4090.

Keep in mind these numbers only cover loading the model weights. The context window (KV cache) adds significant memory on top, especially at the 128K–256K token context lengths these models support. Fine-tuning requires even more memory, often 2–4x the inference requirement depending on your approach.

How to use Gemma 4 effectively

Getting started is straightforward. You can download models from Hugging Face or Kaggle. For local inference, Ollama provides the easiest path — just pull the model and start chatting. For more control, llama.cpp with GGUF checkpoints gives you fine-grained quantization options. Google AI Studio also hosts the 31B and 26B models if you want to try them without downloading anything.

For production deployment, vLLM and NVIDIA’s TensorRT-LLM both support Gemma 4. Unsloth provides optimized and quantized models for efficient local fine-tuning. If you’re building agentic applications, the built-in function-calling support means you can use Gemma 4 as a tool-using agent without needing external frameworks to handle tool dispatch.

One of the most effective ways to use Gemma 4 is with its configurable thinking mode. For straightforward tasks, you can use standard generation. For complex reasoning, coding, or multi-step problems, enable thinking mode to let the model work through its reasoning before producing an answer. This is similar to how models like DeepSeek-R1 or Claude use chain-of-thought reasoning, but it’s built natively into Gemma 4.

Here's a quick Python example to get Gemma 4 running locally on your machine using the Hugging Face transformers library with 4-bit quantization:

# Install dependencies first:
# pip install transformers torch bitsandbytes accelerate

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure 4-bit quantization for lower VRAM usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_id = "google/gemma-4-4b-it"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Create a conversation
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

# Tokenize and generate
input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

# Decode and print the response
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

What Gemma 4 is good for

Coding and development — Gemma 4 shows notable improvements on coding benchmarks. The 31B model in particular performs well on code generation, debugging, and code review tasks. Combined with function-calling support, it’s a strong choice for AI-assisted development workflows.

Multimodal understanding — All models handle text and image input with variable aspect ratio and resolution support. The smaller models add audio and video processing. This makes Gemma 4 useful for document understanding, image analysis, and building applications that need to reason across modalities.

On-device and edge deployment — The E2B and E4B models are specifically designed for running on phones, tablets, and browsers. If you need an AI model that works offline or with low latency on consumer hardware, these are among the best options available.

Agentic applications — Built-in function-calling and system prompt support make Gemma 4 well-suited for building autonomous agents. The configurable thinking mode helps with complex planning and multi-step reasoning that agents require.

Long-context tasks — With context windows up to 256K tokens, Gemma 4 can process entire codebases, long documents, or extended conversations without losing track of earlier context. This is competitive with the best proprietary models.

Privacy-sensitive applications — Since you can run these models entirely on your own hardware, your data never leaves your infrastructure. This matters for healthcare, legal, financial, and government applications where data residency is a requirement.

What Gemma 4 is not good for

Competing with the largest frontier models on raw capability — Gemma 4 is impressive for its size class, but the 31B model won’t match GPT-4o, Claude Opus, or Gemini Ultra on the most demanding tasks. If you need absolute peak performance and cost isn’t a concern, proprietary APIs still have an edge for the hardest reasoning and generation tasks.

Tasks requiring massive world knowledge — Smaller models inherently have less parametric knowledge baked into their weights. The E2B model running on a phone won’t have the same depth of factual recall as a 400B+ parameter model. For knowledge-intensive applications, you’ll want to pair Gemma 4 with retrieval-augmented generation (RAG) to supplement its built-in knowledge.

Very long generation at high throughput — Running large Gemma 4 models locally requires significant hardware investment. If you need to serve thousands of concurrent users generating long responses, the infrastructure costs of self-hosting may exceed the cost of using a managed API service, at least until you reach significant scale.

Output in non-English languages — Like most open models, Gemma 4 is strongest in English. While it supports many languages, its performance on low-resource languages may not match dedicated multilingual models or the largest proprietary systems that have been specifically tuned for global language coverage.

How to download Gemma 4

One of the best things about Gemma 4 is how easy it is to get your hands on the model weights. Google has made them available through multiple channels, so you can pick whichever fits your workflow.

Hugging Face

The most popular option for developers and researchers. All Gemma 4 variants are hosted on Hugging Face under the google organization. You can browse the full collection at huggingface.co/collections/google/gemma-4. To download a model, you’ll need a free Hugging Face account. Once logged in, you can pull any variant using the transformers library or the Hugging Face CLI.

For example, to load the instruction-tuned E4B model in Python, install the latest transformers (version 5.5.0 or later) and torch, then use AutoModelForImageTextToText.from_pretrained("google/gemma-4-E4B-it"). The model weights download automatically on first use.

Ollama (easiest for local use)

If you just want to run Gemma 4 on your laptop with zero hassle, Ollama is the way to go. Install Ollama, then run a single command: ollama run gemma4. That’s it. Ollama handles downloading the quantized weights, setting up the runtime, and exposing a local API. It’s the fastest path from zero to a working local model. The Gemma 4 models on Ollama have already crossed over a million downloads.

Kaggle

Google also hosts the model weights on Kaggle at kaggle.com/models/google/gemma-4. This is useful if you’re already working in a Kaggle notebook environment and want to load the weights directly without any additional downloads.

Google AI Studio

If you don’t want to download anything at all, you can try Gemma 4 directly in your browser through Google AI Studio at aistudio.google.com. Select the Gemma 4 model and start prompting. No setup required—this is great for quick experiments before committing to a local installation.

Can you fine-tune Gemma 4?

Yes, absolutely. Fine-tuning is one of the biggest advantages of open-weight models like Gemma 4, and Google has made the process straightforward. Since the weights are fully open under Apache 2.0, you have complete freedom to adapt the model to your specific domain—whether that’s legal documents, medical records, customer support, code generation for your tech stack, or anything else.

The recommended approach: QLoRA

The most practical fine-tuning method for Gemma 4 is QLoRA (Quantized Low-Rank Adaptation). Instead of training the entire model (which would require enormous GPU memory), QLoRA quantizes the base model to 4-bit precision and freezes those weights. Then it attaches small trainable adapter layers (LoRA) and only trains those. This dramatically reduces the memory requirements while maintaining high performance.

According to Unsloth’s benchmarks, the VRAM requirements for fine-tuning Gemma 4 with LoRA are surprisingly accessible. The E2B model can be fine-tuned with just 8–10 GB VRAM—that’s a consumer-grade GPU like an RTX 3060 or even a free Google Colab T4 instance. The E4B model needs about 17 GB VRAM. For the larger 26B A4B model, you’re looking at around 22 GB with QLoRA, and the 31B dense model needs 40+ GB.

How to fine-tune: step by step

The fine-tuning workflow uses Hugging Face’s TRL (Transformer Reinforcement Learning) library together with PEFT (Parameter-Efficient Fine-Tuning) and bitsandbytes for quantization. Here’s the general process:

First, install the dependencies: pip install trl peft datasets bitsandbytes transformers torch accelerate. Then load your base model with 4-bit quantization using BitsAndBytesConfig. Next, configure your LoRA adapters using LoraConfig from PEFT—typical settings include a rank of 16, alpha of 16, and dropout of 0.05, targeting all linear layers. Prepare your training dataset in the conversational format (system/user/assistant message pairs). Finally, create a SFTTrainer instance from TRL and call trainer.train(). Google provides a complete working example in their official documentation that fine-tunes Gemma on a text-to-SQL dataset using a T4 GPU in Google Colab.

Fine-tuning tools and platforms

Unsloth is a popular community tool that optimizes fine-tuning performance and memory efficiency. It supports all Gemma 4 variants and claims significant speedups over vanilla Hugging Face training. Unsloth Studio even offers a no-code UI for fine-tuning if you’d rather skip the Python scripts. Keras is another option—Google provides an official guide for fine-tuning Gemma with LoRA using the Keras framework, which some developers prefer for its simplicity. For enterprise-scale fine-tuning, Vertex AI on Google Cloud handles the infrastructure for you and supports distributed training across multiple GPUs.

Here's a complete example of fine-tuning Gemma 4 on your local machine using Unsloth with QLoRA. This script loads the model in 4-bit, attaches LoRA adapters, trains on a sample dataset, and saves the result:

# Install Unsloth (includes all dependencies)
# pip install unsloth

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# 1. Load model with Unsloth (4-bit quantization by default)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-4-4b-it-unsloth-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# 2. Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,          # Unsloth recommends 0 for speed
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=3407,
)

# 3. Load and format your dataset
# Using a sample instruction dataset - replace with your own
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

def format_prompt(example):
    """Format dataset into chat template."""
    if example["input"]:
        text = ("Below is an instruction with input. Write a response.\n\n"
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Input:\n{example['input']}\n\n"
                f"### Response:\n{example['output']}")
    else:
        text = ("Below is an instruction. Write a response.\n\n"
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Response:\n{example['output']}")
    return {"text": text + tokenizer.eos_token}

dataset = dataset.map(format_prompt)

# 4. Set up training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,           # Set higher for real training
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

# 5. Train!
trainer_stats = trainer.train()

# 6. Save the fine-tuned model
# Option A: Save LoRA adapters only (small file)
model.save_pretrained("gemma4-finetuned-lora")
tokenizer.save_pretrained("gemma4-finetuned-lora")

# Option B: Merge LoRA into base model and save full weights
# model.save_pretrained_merged("gemma4-finetuned-merged", tokenizer)

# 7. Test the fine-tuned model
FastLanguageModel.for_inference(model)
inputs = tokenizer(
    ["Below is an instruction. Write a response.\n\n"
     "### Instruction:\nExplain what a neural network is.\n\n"
     "### Response:\n"],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning: merging and deploying

Once your fine-tuning run is complete, you have two options. You can keep the LoRA adapters separate and load them on top of the base model at inference time. Or you can merge the adapters directly into the base model weights using PEFT’s merge_and_unload method, which produces a standard model that can be deployed with any serving stack like vLLM or text-generation-inference. The merged model works exactly like a regular Hugging Face model—no special adapter loading code required.

The combination of Apache 2.0 licensing and accessible fine-tuning means you can build a genuinely custom AI model tailored to your exact use case, deploy it however you want, and never worry about usage restrictions or API costs. That’s the real power of open-weight models.

The bottom line

Gemma 4 under Apache 2.0 represents a genuine inflection point for open AI models. Google is no longer hedging with custom licenses or restrictive terms. They’re putting frontier-capable models into the open under the same license that governs much of the world’s infrastructure software. Whether you’re a solo developer running the E2B on a laptop, a startup deploying the MoE model for cost-effective inference, or an enterprise that needs the 31B for on-premises deployment, there’s a Gemma 4 model sized for your use case.

The combination of genuine open-source licensing, multimodal capabilities, competitive benchmarks, and the range from pocket-sized to server-grade makes this one of the most significant open model releases of 2026 so far.