I’ve been super hesitant to use AI agents for coding. I could pay $20/month for Codex or Claude Code, but what if it writes mediocre code that I end up needing to refactor? What if I hit the usage limit and it ends up being much more than 20 bucks a month?
Gemma 4 was released on March 31, 2026; I found out about it 3 days ago via a YouTube video. The idea of an intelligent model that’s efficient enough to run locally sounded almost too good to be true.
I got it up and running on my own PC today, and I gotta say: a few hours of troubleshooting is absolutely worth the enormous sum of money I’ll save over the course of my career.
The remainder of this post shares details about how I used a Windows PC[1] with an Nvidia GPU to run Gemma 4. Hopefully after reading this, you won’t need to spend as much time troubleshooting as I did.
OpenCode is phenomenal: I’m pretty sure it’s the best way to turn model weights into a full-fledged coding agent for free.[2]
The OpenCode docs are helpful, but some of the information posted there led me astray:
- They say “you’ll need a modern terminal emulator”, but that’s not true! All I needed was a PowerShell instance to run a Gemma 4 server, and a second PowerShell instance to run OpenCode.
- According to the docs, WSL is recommended for the best experience. To be honest, I don’t know what the WSL experience is like, since OpenCode worked fine for me without it.
- The IDE page says you can open the VS Code integrated terminal and run
opencodeto automatically install the extension. That didn’t work. Even when manually installing the extension and running OpenCode in the integrated terminal, it still didn’t work: sometimes it wouldn’t open at all, and sometimes I could type prompts but then nothing would happen in response. I definitely recommend using a standalone terminal window over the VS Code one. - The providers page has an overwhelming list of ways to connect a model. Ollama supports local models and is featured on the Gemma 4 page, but it was so frustrating to use. On multiple occasions I tried prompting it to read or edit one or more files, and the results ranged from being stuck in a thought loop forever to just responding with “no sorry, I can’t do that”. The way I eventually found success involved ditching Ollama entirely.
I found that the best way to launch Gemma 4 is with llama.cpp: it’s the only LLM provider that connects a local model to OpenCode and can use Nvidia’s CUDA Toolkit to run on a GPU.[3]
An AI chatbot said my RTX 3060 with 8GB VRAM would be able to handle the gemma-4-E4B-it-UD-Q4_K_XL model,[4] so I downloaded that one.
Since I already have Git and Visual Studio (they’re required for building Flutter apps), all I needed was CMake, OpenSSL, and CUDA.
Once those were all installed, it was just a matter of setting up llama.cpp.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake -B build .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build . --config Release
Or at least that’s what should have happened. In reality, I didn’t know exactly what I was doing, so I kept throwing random cmake commands until it worked:
cmake ..
cmake -B build . -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake -B build -DGGML_CUDA=ON
cmake -B build .. -DGGML_CUDA=ON
cmake -B build . -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" --config Release
cmake --build . --config Release -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake -B build . -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build . --config Release
The build took a very long time but eventually it was done, and I could run the server.[5]
llama-server -m "C:\<filepath>\gemma-4-E4B-it-UD-Q4_K_XL.gguf" -ngl 99 --port 8080 --ctx-size 131072
I found the opencode.json config file and set the provider so its base URL matched the port from that llama-server command.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llama-local": {
"name": "gemma-4-E4B-it-UD-Q4_K_XL",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://127.0.0.1:8080/v1"
},
"models": {
"local-model-id": {
"name": "Gemma 4",
"tool_call": true
}
}
}
},
"mcp": {
"dart-mcp-server": {
"command": [
"dart",
"mcp-server"
],
"enabled": true,
"environment": {},
"type": "local"
}
}
}
Adding the Dart MCP server info here allows the agent to run static analysis whenever it edits the codebase.
With the llama-server running in one PowerShell terminal, all that’s left to do is open another one for OpenCode:
cd .\path\to\my-project
opencode
And now I have Agentic AI that doesn’t cost a single dime.[6]
Overall, I’m happy with the setup thus far, though there’s still room for improvement. Perhaps at some point I could upgrade to an RTX 4090 with 24GB VRAM; that one can handle the more powerful gemma-4-31B model.
I’m also keeping an eye on flutter.dev/go/packaged-ai-assets, since it looks like an exciting project that would boost agent productivity.
click here for a MacOS tutorial! ↩︎
Another option would be to use a clean room rewrite of Claude Code’s leaked source code… I guess we’ll see how that situation develops. ↩︎
Who knew that a gaming PC would be a good investment as a software engineer? ↩︎
E4B is the variation with 4 billion parameters.
“it” stands for “instruction tuned” (it responds to prompts instead of just predicting the next word).
UD-Q4_K_XL describes the quantization technique: super efficient and takes advantage of the available VRAM. ↩︎I used a
.gguffile I downloaded, but it might be easier to just use the -hf flag. ↩︎Technically there’s the electricity cost (and maybe GPU wear-and-tear?) but that feels overly pedantic. ↩︎