PlatformRecommendationsDetectors

Recommendations Engine

8 min read

The recommendations engine is the part of QueryWise that turns observation into action. It runs three parallel paths — pattern detection, AI deep analysis, and right-sizing — and combines them into a single ranked recommendation feed.

What a recommendation is

Every recommendation has the same shape:

{
  "id": "...",
  "category": "query_tuning",
  "severity": "high",
  "title": "Convert wide SELECT * to projected columns",
  "description": "...",
  "evidence": { ... },
  "remediation": "ALTER VIEW ...",
  "estimated_monthly_savings": 1240.00,
  "confidence": 0.78,
  "source": "detector",
  "vendor": "snowflake",
  "affected_resource": "ACCOUNT.SCHEMA.TABLE"
}

source is one of detector, ai, right_sizing. confidence for detector-generated recs is a fixed score per pattern; for AI-generated recs it's the agent's self-reported confidence (gated to ≥0.4).

Path 1: Anti-pattern detector library

The detector library lives in qw_shared.anti_patterns and currently has 130+ detectors. Categories:

  • Generic SQL — SELECT *, missing predicates, Cartesian joins, N+1, unused indexes, repeated stale duplicates
  • Snowflake — warehouse undersized, poor clustering, missing partitioning, serverless candidates, auto-suspend tuning
  • BigQuery — slot waste, partition pruning misses, materialized view candidates, hidden cost traps
  • Redshift — WLM queueing, dist key skew, vacuum needs, compile overhead, concurrency scaling
  • Databricks / Spark — shuffle-heavy, broadcast candidates, oversized warehouses, missing liquid clustering
  • Synapse — data movement, columnstore inefficiency, DWU overprovisioning, resource class mismatch
  • MongoDB — pipeline ordering, shard imbalance, cross-partition queries, scan limits
  • Cosmos DB — RU hotspots, low index hit rate, large document scans, autoscale misfits
  • Cortex / AI — Cortex user sprawl, Cortex Search inefficiency, Cortex Analyst overuse, document parsing redundancy
  • AI / LLM — output verbosity, repeated processing, model tiering candidates, batch API opportunities, idle endpoints, prompt caching, embedding cache misses
  • Pipelines — idle Snowpipes, over-provisioned Glue DPUs, DLT cold-start tax, Dataflow streaming with low input rate, Data Factory copy activity not pushing down (20 detectors total)
  • Vector / RAG — embedding-side cost patterns (27 detectors)

Each detector returns an AntiPatternMatch with severity, affected queries/resources, evidence, and a remediation snippet. The recommendation generator filters out duplicates, attaches savings estimates, and writes the recs to the recommendations table.

Path 2: AI deep analysis

For queries pattern detection can't help with, the deep analysis flow hands the query off to the analysis agent. The agent has tools to:

  • fetch_query_stats — execution stats for the query
  • fetch_query_plan — the most recent explain plan
  • fetch_schema_metadata — schema for tables touched by the query
  • fetch_billing_context — cost context for the warehouse / cluster
  • fetch_existing_recommendations — to dedupe against open recs
  • save_ai_recommendation — the structured output sink

The agent runs an agentic loop, capped by:

  • Per-tenant budget — daily token / call cap
  • Per-user rate limit — 5 deep analyses per user per day
  • Confidence threshold — recs below 0.4 are dropped
  • Plausibility cap — savings estimates >$50,000/month are flagged for review
  • Audit log — every run is recorded in analysis_audit_logs

The agent is deliberately scoped — it can't write to anything other than the recommendations table and audit log.

Path 3: Right-sizing

Right-sizing is its own path because it doesn't operate on queries — it operates on compute primitives. The detectors here read load history, concurrency, and idle time across the daily curve, then compare against the vendor's tier catalog (Snowflake warehouse sizes, Databricks node types, RDS classes, etc.).

A right-sizing rec has the same shape as the others but the evidence field includes the load curve and concurrency stats so the workload owner can sanity-check the recommendation before applying it.

Severity & ranking

Severities: critical | high | medium | low. Severity is determined by:

  • Estimated monthly savings (heuristic ceilings: critical >= $5k, high >= $1k, medium >= $250)
  • Risk of not applying (idle endpoint billed 24/7 → high)
  • Confidence (low-confidence recs get severity downgraded)

The default sort in the dashboard is estimated savings descending, but you can re-sort by category, vendor, or age.

Savings estimation

Savings come from savings_estimator.py, which knows each vendor's pricing model:

  • Snowflake credits → USD with the customer's contract rate (configurable)
  • Databricks DBU → USD by SKU
  • BigQuery slot-seconds and bytes scanned
  • AWS RDS instance-hour pricing
  • Cloud-native rates for compute

If your contract rate isn't configured, we use list price as the default and clearly mark the estimate.

Confidence calibration

After a recommendation is applied, the before/after impact panel computes actual savings from billing data. If the actual is below 50% of estimated, the recommendation is marked overstated and the savings estimator's bias is updated for similar future recs. This is what we mean by calibration.

Apply workflows

Three apply paths:

  1. Manual — copy the remediation, deploy yourself
  2. Auto-apply policy — define a policy in Settings → Auto-Apply scoped by category, severity, and vendor
  3. Ticket sync — push as JIRA / ServiceNow ticket; bi-directional status sync

DDL changes default to ticket sync. Auto-apply is enabled by default for low-risk categories like auto-suspend tuning.

Where to next

Need help with onboarding?

Design partners get full docs plus hands-on support from our engineering team.