AI / LLM Cost Optimization

AI spend is the line item that grew the fastest on every cloud bill we've reviewed in the last year. It's also the one with the loosest cost discipline — most teams adopted LLMs as a feature first and a cost center second.

The good news is that LLM cost optimization is mechanical. The same handful of patterns drive 80% of the savings, regardless of vendor.

What QueryWise tracks

We treat AI workloads as first-class billing entities, not just "API costs." Connectors to OpenAI, Anthropic, AWS Bedrock, SageMaker, Google Vertex AI, and Azure OpenAI Service feed the Costs → AI tab, which breaks spend down by:

Vendor — OpenAI vs Anthropic vs Bedrock vs Vertex
Model — GPT-4o vs GPT-4o-mini, Claude Opus vs Sonnet vs Haiku, Titan vs Cohere
Workload type — chat completion, embedding, fine-tuning, batch
Input vs output tokens — these have very different unit costs
Caller / application — when tagging is in place

Once that's flowing, the questions a CFO actually asks become tractable: "what's our AI spend as a % of cloud", "which model is the biggest line item", "which team drove the increase last month."

The seven patterns that drive most savings

The detector library has 27+ AI-specific detectors. The themes:

1. Output verbosity

The most expensive thing about a long answer is that you paid for it. Detectors flag prompts whose outputs are systematically longer than necessary, where adding a max_tokens ceiling cuts cost without changing fidelity.

2. Repeated processing

Same prompt, same context, computed thousands of times a day because no one added a cache. We detect this from prompt-hash repetition in the request stream and recommend prompt caching (Anthropic / Bedrock) or an application-level cache.

3. Model tiering / downgrade candidates

A workload running on GPT-4o that scores within tolerance on GPT-4o-mini. Detectors find these by sampling output traffic and flagging where the larger model isn't earning its premium.

4. Batch API opportunities

OpenAI and Anthropic both offer 50% discounted batch APIs for non-realtime workloads. Detectors find traffic patterns that fit batch (high volume, latency-tolerant).

5. Idle inference endpoints

SageMaker, Bedrock provisioned throughput, Vertex endpoints — these bill 24/7 even when traffic is bursty. Detectors flag endpoints with low utilization and recommend on-demand or smaller throughput.

6. Prompt caching candidates

Anthropic prompt caching cuts repeated context cost by ~90%. Detectors find prompts where the system message and tools are stable across many calls — the highest-leverage caching opportunity.

7. Embedding cache misses

Embedding workloads typically have a long tail of repeated documents. Detectors look for signature-level repetition and recommend a vector cache layer.

Cortex / Snowflake AI specifically

Snowflake Cortex bills as credit consumption against a warehouse. Detectors here are different:

Cortex user sprawl — too many users running ad-hoc Cortex queries on the wrong warehouse.
Cortex Search inefficiency — over-indexed search services with low query rate.
Cortex Analyst overuse — Analyst queries bypassing the cheaper completion path.
Document parsing redundancy — re-parsing the same document repeatedly because no caching layer.

These show up alongside the rest of your AI workloads on the AI cost tab.

The workflow

A practical AI cost review using QueryWise:

Connect every AI vendor. Until OpenAI, Bedrock, Vertex, and your hosted models are all in one view, you can't answer the basic questions.
Tag your callers. Tag spend with the application or feature that's calling. This is the difference between "OpenAI cost is up 40%" and "the new search feature drove $80k of that increase."
Apply the cheap fixes first. Batch APIs, max_tokens ceilings, idle endpoint shutdowns. These are reversible and have no fidelity risk.
Test the model tiering candidates. The agent can shadow-route a sample to the cheaper model and report quality delta. We recommend a 1-2 day shadow before downgrading.
Add prompt caching. This is the highest-leverage one-time fix for chat-style workloads with stable system prompts. Implementation is small; savings are 60–90% on cached portions.

What to track week-over-week

Three numbers that matter:

AI spend as % of total cloud — gives you the trend at the right grain
Cost per request, by application — tells you when a code change degraded efficiency
Cache hit rate, where applicable — leading indicator of cost savings

Where to next

Cost Attribution Foundations — without tags, AI cost optimization is guesswork.
Query Optimization Playbook — for retrieval-side cost (vector DBs, RAG pipelines).