Works with OpenAI, Gemini, Grok, and more

Cut your LLM API costs by up to 80%

SemaCache is an intelligent caching proxy for LLM APIs. It returns cached responses when a semantically similar query has been seen before — saving you money on every repeated question.

Start Free See how it works

Drop-in replacement — change one line

# Before — calling OpenAI directly
client = OpenAI(api_key="sk-...")

# After — just change the base URL
client = OpenAI(
    api_key="sc-your-key",
    base_url="https://api.semacache.io/v1"
)

Free tier included

No code changes required

Supports OpenAI, Gemini & Grok

~5ms

Exact match latency

~20ms

Semantic match latency

80%

Avg cost reduction

99.9%

Uptime SLA

Stop overpaying for LLM API calls

You're burning $81/mo on repeat queries

Up to 40% of your API calls return the same or similar answers. SemaCache intercepts them before they hit your LLM — so you only pay once.

Configure your usage

Which model do you use?

~$0.00/request at average token usage

How many requests per month?25K

1K10K100K500K

Cache hit rate40%

ConservativeMost teams: 30–60%Aggressive

Your monthly savings

$33

That's $390/year back in your pocket

40%

cost reduction

10K

free cache hits

payback period

$81/mo$58/mo

Stop leaving $390/year on the table.

Pro pays for itself in 8 days. Cancel anytime. No risk.

Start Saving Now

Trusted by developers building with OpenAI, Gemini, and custom models. One line of code. Instant savings.

Features

Three tiers of intelligent caching

Every request flows through a fast pipeline: exact hash → semantic similarity → LLM passthrough. Each tier is cheaper and faster than calling the LLM directly.

Exact Match Cache

MD5 hash lookup in Redis. Identical queries return cached responses in under 5ms.

Semantic Match Cache

Gemini-powered embeddings with pgvector similarity search. Catches paraphrased queries automatically.

Multi-Provider Routing

One endpoint for OpenAI, Gemini, and xAI Grok. SemaCache auto-detects the provider from the model name and routes accordingly.

Encrypted Key Storage

Store your LLM API keys securely in the dashboard. AES-256 encrypted at rest — keys never leave our servers.

Real-Time Analytics

Dashboard with cache hit rates, latency metrics, cost savings, and daily request volume per API key.

OpenAI-Compatible API

Drop-in replacement for any OpenAI SDK client. Works with Python, JavaScript, Go, and every other language.

How it works

From request to response in milliseconds

Your app sends a request

Point your OpenAI client at SemaCache. Your app sends requests as usual — no code changes needed beyond changing the base URL.

Exact match check

We hash the query and check Redis. If the identical query was asked before, the cached response is returned in ~5ms.

Semantic similarity search

If no exact match, we embed the query with Gemini and search our pgvector index. Paraphrased queries like "What's France's capital?" match "Capital of France?" with high confidence.

LLM passthrough & cache

On full miss, we route to the correct provider (OpenAI, Gemini, or Grok based on model name), return the response, and cache it for future hits.

Live production benchmarks

Text, images, and video — all cached.

Every API call goes through the same three-tier pipeline. The first request generates and caches. Every repeat returns instantly — whether it’s a chat reply, a 4K image, or a generated video.

133×

Chat speedup

16×

Image speedup

76×

Video speedup

<1s

All cache hits

Measured end-to-end on production (Google Cloud Run), including full network round-trip. Chat: OpenAI GPT-4o Mini & Gemini 2.0 Flash. Image: OpenAI GPT Image 1 & Google Imagen 4.0. Video: Google Veo 2 & Veo 3. Same caching applies to xAI Grok and all other supported models.

Supported Models

Works with every major LLM provider

Built-in support for OpenAI, Gemini, xAI Grok, Imagen, and Veo. Plus register any OpenAI-compatible endpoint as a custom model.

OpenAI

Chat Completions

gpt-5.4gpt-5.4-minigpt-5.4-nanogpt-4.1gpt-4.1-minigpt-4.1-nanogpt-4ogpt-4o-minio3o3-minio4-mini

Google Gemini

Chat Completions

gemini-3.1-pro-previewgemini-3-flash-previewgemini-3.1-flash-lite-previewgemini-2.5-progemini-2.5-flashgemini-2.5-flash-lite

xAI Grok

Chat Completions

grok-4.20grok-4grok-4-fastgrok-3grok-3-minigrok-3-fast

Image Generation

OpenAI, Google, xAI

gpt-image-1.5gpt-image-1gpt-image-1-miniimagen-4.0-generate-001imagen-4.0-ultra-generate-001imagen-4.0-fast-generate-001grok-imagine-imagegrok-imagine-image-pro

Video Generation

Google Veo, xAI

veo-3.1-generate-previewveo-3.1-fast-generate-previewveo-3.1-lite-generate-previewveo-3.0-generate-001veo-3.0-fast-generate-001veo-2.0-generate-001grok-imagine-video

Pro & Enterprise

Bring your own model

Register any OpenAI-compatible endpoint — vLLM, Ollama, Together AI, Groq, Fireworks, or your own self-hosted model. SemaCache caches responses from custom models the same way it caches OpenAI and Gemini.

Register via dashboard or API — set base URL, model name, and auth
Full three-tier caching: exact → semantic → passthrough
Works with any provider that speaks OpenAI-compatible format

# Register "my-llama" in Dashboard → Custom Models
# Then use it like any built-in model

from openai import OpenAI

client = OpenAI(
  api_key="sc-your-key",
  base_url="https://api.semacache.io/v1"
)

response = client.chat.completions.create(
  model="my-llama",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)