Cut your LLM API costs by up to 80%
SemaCache is an intelligent caching proxy for LLM APIs. It returns cached responses when a semantically similar query has been seen before — saving you money on every repeated question.
# Before — calling OpenAI directly
client = OpenAI(api_key="sk-...")
# After — just change the base URL
client = OpenAI(
api_key="sc-your-key",
base_url="https://api.semacache.io/v1"
)You're burning $81/mo on repeat queries
Up to 40% of your API calls return the same or similar answers. SemaCache intercepts them before they hit your LLM — so you only pay once.
Configure your usage
~$0.00/request at average token usage
Your monthly savings
$33
That's $390/year back in your pocket
40%
cost reduction
10K
free cache hits
8d
payback period
Stop leaving $390/year on the table.
Pro pays for itself in 8 days. Cancel anytime. No risk.
Trusted by developers building with OpenAI, Gemini, and custom models. One line of code. Instant savings.
Features
Three tiers of intelligent caching
Every request flows through a fast pipeline: exact hash → semantic similarity → LLM passthrough. Each tier is cheaper and faster than calling the LLM directly.
Exact Match Cache
MD5 hash lookup in Redis. Identical queries return cached responses in under 5ms.
Semantic Match Cache
Gemini-powered embeddings with pgvector similarity search. Catches paraphrased queries automatically.
Multi-Provider Routing
One endpoint for OpenAI, Gemini, and xAI Grok. SemaCache auto-detects the provider from the model name and routes accordingly.
Encrypted Key Storage
Store your LLM API keys securely in the dashboard. AES-256 encrypted at rest — keys never leave our servers.
Real-Time Analytics
Dashboard with cache hit rates, latency metrics, cost savings, and daily request volume per API key.
OpenAI-Compatible API
Drop-in replacement for any OpenAI SDK client. Works with Python, JavaScript, Go, and every other language.
How it works
From request to response in milliseconds
Your app sends a request
Point your OpenAI client at SemaCache. Your app sends requests as usual — no code changes needed beyond changing the base URL.
Exact match check
We hash the query and check Redis. If the identical query was asked before, the cached response is returned in ~5ms.
Semantic similarity search
If no exact match, we embed the query with Gemini and search our pgvector index. Paraphrased queries like "What's France's capital?" match "Capital of France?" with high confidence.
LLM passthrough & cache
On full miss, we route to the correct provider (OpenAI, Gemini, or Grok based on model name), return the response, and cache it for future hits.
Text, images, and video — all cached.
Every API call goes through the same three-tier pipeline. The first request generates and caches. Every repeat returns instantly — whether it’s a chat reply, a 4K image, or a generated video.
Measured end-to-end on production (Google Cloud Run), including full network round-trip. Chat: OpenAI GPT-4o Mini & Gemini 2.0 Flash. Image: OpenAI GPT Image 1 & Google Imagen 4.0. Video: Google Veo 2 & Veo 3. Same caching applies to xAI Grok and all other supported models.
Supported Models
Works with every major LLM provider
Built-in support for OpenAI, Gemini, xAI Grok, Imagen, and Veo. Plus register any OpenAI-compatible endpoint as a custom model.
OpenAI
Chat Completions
Google Gemini
Chat Completions
xAI Grok
Chat Completions
Image Generation
OpenAI, Google, xAI
Video Generation
Google Veo, xAI
Bring your own model
Register any OpenAI-compatible endpoint — vLLM, Ollama, Together AI, Groq, Fireworks, or your own self-hosted model. SemaCache caches responses from custom models the same way it caches OpenAI and Gemini.
- Register via dashboard or API — set base URL, model name, and auth
- Full three-tier caching: exact → semantic → passthrough
- Works with any provider that speaks OpenAI-compatible format
# Register "my-llama" in Dashboard → Custom Models
# Then use it like any built-in model
from openai import OpenAI
client = OpenAI(
api_key="sc-your-key",
base_url="https://api.semacache.io/v1"
)
response = client.chat.completions.create(
model="my-llama",
messages=[
{"role": "user", "content": "Hello!"}
]
)Pricing
Start free, scale with confidence
Every plan includes multi-provider support and encrypted key storage.
Free
For experimentation and side projects
- 1,000 requests / month
- 1 API key
- Text + image caching
- 7-day audit logs
- Community support
Pro
For developers shipping to production
- 50,000 requests / month
- 5 API keys
- Text + image + video caching
- Custom model registry
- 30-day audit logs
- Email support
Enterprise
For teams at scale
- 500,000 requests / month
- Unlimited API keys
- Text + image + video caching
- Custom model registry
- 90-day audit logs
- Priority support
Ready to cut your LLM costs?
Get started in under a minute. No credit card required. Change one line of code and start saving.