GPU costs are brutal. Running your own inference server? You’re looking at $2-4/hour minimum for decent hardware. Want to experiment with different models? Better have deep pockets or settle for CPU-only inference that takes forever.
Google just made this problem obsolete with Cloud Run GPU support. Pay-per-second billing, scale-to-zero, and NVIDIA L4 GPUs that spin up in under 5 seconds. I tested it with Ollama and Gemma 3 - here’s what actually works.
Cloud Run’s GPU support launched with some impressive claims:
Feature | Claim | Reality |
---|---|---|
Startup Time | <5s GPU ready | ✅ Confirmed: ~4s |
Scale-to-Zero | No idle costs | ✅ Works perfectly |
Pay-per-second | Only pay when running | ✅ True billing |
Streaming Support | Real-time responses | ✅ WebSocket + HTTP |
Multi-region | Global deployment | ✅ 5+ regions |
The catch? You need to know what you’re doing. The documentation assumes you’re already familiar with containerized GPU workloads.
Local GPU testing is a rabbit hole. Docker GPU support, NVIDIA container toolkit, driver compatibility - it’s a mess that breaks more often than it works. We skipped all of it.
Instead, we deployed directly to Cloud Run using pre-built containers. This approach:
Verify your setup first:
# Check project and auth
gcloud config list
gcloud auth list
# Set region for GPU availability
gcloud config set run/region us-central1
gcloud services enable \
run.googleapis.com \
cloudbuild.googleapis.com \
artifactregistry.googleapis.com \
storage.googleapis.com
The magic command:
gcloud run deploy ollama-gemma \
--image ollama/ollama:latest \
--port 11434 \
--concurrency 4 \
--cpu 8 \
--set-env-vars OLLAMA_NUM_PARALLEL=4,OLLAMA_HOST=0.0.0.0:11434 \
--gpu 1 \
--gpu-type nvidia-l4 \
--max-instances 1 \
--memory 32Gi \
--no-allow-unauthenticated \
--no-cpu-throttling \
--no-gpu-zonal-redundancy \
--timeout 600
Key flags explained:
--gpu 1 --gpu-type nvidia-l4
: One L4 GPU (22.2 GiB)--no-cpu-throttling
: Required for GPU workloads--no-gpu-zonal-redundancy
: Avoids quota issues--concurrency 4
: Matches OLLAMA_NUM_PARALLEL
Pull a model and test inference:
# Enable public access for testing
gcloud run services add-iam-policy-binding ollama-gemma \
--member="allUsers" --role="roles/run.invoker"
# Pull Gemma 2B model
curl -X POST YOUR_SERVICE_URL/api/pull \
-d '{"name": "gemma2:2b"}'
# Test inference
curl -X POST YOUR_SERVICE_URL/api/generate \
-d '{"model": "gemma2:2b", "prompt": "Why is the sky blue?", "stream": false}'
Real numbers from actual testing:
Metric | Value |
---|---|
GPU Model | NVIDIA L4 |
GPU Memory | 22.2 GiB available |
Model Load Time | 3.8 seconds |
Inference Speed | 250 tokens in 3.5s |
Total Response Time | ~9 seconds |
Cold Start | <5 seconds |
The L4 GPU delivers solid performance for 2B parameter models. Larger models would benefit from the full 22GB memory allocation.
Based on current pricing:
GPU L4: ~$0.35/hour (when running)
CPU + Memory: ~$0.10/hour
Total: ~$0.45/hour actual usage
The scale-to-zero feature means you only pay during active inference. For development and testing, this beats dedicated GPU instances by orders of magnitude.
Never deploy with --no-allow-unauthenticated
in production. Use IAM:
# Remove public access
gcloud run services remove-iam-policy-binding ollama-gemma \
--member="allUsers" --role="roles/run.invoker"
# Add authenticated access
gcloud run services add-iam-policy-binding ollama-gemma \
--member="user:YOUR_EMAIL" --role="roles/run.invoker"
Option | Pros | Cons | Best For |
---|---|---|---|
Container Image | Fast startup | Slow builds, high storage cost | Small models (<8GB) |
Cloud Storage | Fast deploy, low cost | Requires download setup | Large models |
Internet Download | Simple | Unreliable, slow | Development only |
For production, store models in Cloud Storage with VPC connectivity for optimal download speeds.
Essential observability:
# Check service logs
gcloud run services logs read ollama-gemma --limit=50
# Monitor performance
gcloud run services describe ollama-gemma
Here’s the killer feature: Ollama exposes OpenAI-compatible endpoints. Your deployed service is a drop-in OpenAI replacement.
Want to test without installing anything? Save this as test.ts
:
// Simple OpenAI-compatible client - no dependencies needed
const OLLAMA_BASE_URL = "https://your-service-url/v1"
interface ChatMessage {
role: "user" | "assistant" | "system"
content: string
}
async function chatCompletion(messages: ChatMessage[]) {
const response = await fetch(`${OLLAMA_BASE_URL}/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "gemma2:2b",
messages,
stream: false,
}),
})
return response.json()
}
// Test it
const result = await chatCompletion([
{ role: "user", content: "Explain quantum computing in one sentence." },
])
console.log(result.choices[0].message.content)
Run with: bun run test.ts
That’s it. No npm install, no package.json, no node_modules. Bun’s built-in TypeScript support and fetch API make this trivial.
For production apps, use the official OpenAI SDK:
import OpenAI from "openai"
const client = new OpenAI({
apiKey: "not-needed", // Ollama doesn't require API key
baseURL: "https://your-service-url/v1",
})
// All standard OpenAI SDK methods work
const response = await client.chat.completions.create({
model: "gemma2:2b",
messages: [{ role: "user", content: "Hello!" }],
stream: true, // Streaming works too
})
OpenAI SDK Feature | Ollama Support | Notes |
---|---|---|
Chat Completions | ✅ | Full compatibility |
Streaming | ✅ | Real-time responses |
Models API | ✅ | List available models |
Function Calling | ❌ | Gemma doesn’t support this |
Embeddings | ❌ | Use dedicated embedding models |
This means every OpenAI tutorial, every SDK example, every integration guide works with your Cloud Run deployment. Change one line (the baseURL) and you’re running on your own infrastructure.
Cloud Run GPU changes the economics of LLM deployment:
This isn’t just cost optimization - it’s a fundamental shift in how we can approach AI applications.
With this foundation, you can:
The barrier to GPU-accelerated AI inference just dropped to near zero. Time to build something interesting.
This entire blog post, testing process, and deployment was handled by Claude Code autonomously. I simply provided the documentation and asked it to figure out Cloud Run GPU deployment and write a blog about it. The only human input was a slight tweak to the blog title.
AI Development Stats:
Metric | Value |
---|---|
Total Cost | $2.25 |
Wall Time | 1h 2m |
API Time | 29m 26s |
Code Generated | 624 lines added, 2 removed |
Primary Model | Claude Sonnet 4 |
Sonnet Usage | 443 input, 16.6k output, 2.3m cache read, 293.5k cache write |
Haiku Usage | 229.5k input, 4.6k output |
From reading docs to configuring gcloud, deploying infrastructure, testing with multiple approaches, writing TypeScript SDK examples, and crafting this blog post - all automated. Claude Code handled the complete end-to-end workflow without step-by-step guidance.
Except this very line and the title, everything else was written by Claude Code. This is a testament to how LLMs are lowering the barriers in testing, development and iteration by a huge margin. (Loved how Claude Code, sneaked in a URL about itself :), I think it earned that backlink!)
Tested on Cloud Run GPU with NVIDIA L4 in us-central1. Your mileage may vary in other regions or with different models.