The Problem Nobody Talks About

LLM infrastructure spend has exploded, it's now roughly half of all enterprise gen-AI investment. But here's what happens next: someone hooks the LLM up to the database and says "go query stuff."

One hallucinated query later, your ClickHouse cluster is melting. Full scan on a 500GB table. No WHERE clause on the date column. A JOIN that should have filtered to 1,000 rows returned 50 million instead. Runaway costs. Angry teams.

Three Guardrail Patterns That Actually Work

The fix isn't to disable AI. It's to give it guardrails.

Pattern A: Query Templates, Not Raw SQL

Instead of letting the LLM generate SQL from scratch, give it a menu of pre-vetted templates. The LLM's job is to choose the right template and fill in parameters.

-- Template 1: "Events by day for a specific user" 
SELECT    
toDate(event_time) AS day,   
count() AS events 
FROM events WHERE 
user_id = $1   AND event_time >= $2   AND event_time < $3 
GROUP BY day  

-- Template 2: "Top 10 events by count" 
SELECT    
event_type,   
count() AS cnt 
FROM events 
WHERE event_time >= $1   AND event_time < $2 
GROUP BY event_type 
ORDER BY cnt 
DESC LIMIT 10 

LLM input: "Show events for user 12345 this week"

LLM decides: Template 1 is the right fit, fills in $1 = 12345, $2 = 2025-01-01, $3 = 2025-01-08.

Result: No surprise full scans. No runaway queries. Just parameter binding.

Pattern B: Narrow Schema via Materialized Views

Don't give the LLM access to the raw firehose. Give it a curated, pre-aggregated view.

-- What LLM sees: 
CREATE TABLE safe_analytics (   date Date,   event_type String,   user_segment String,   event_count UInt64,   unique_users UInt64 ) ENGINE = SummingMergeTree 
ORDER BY (date, event_type, user_segment);  
-- Instead of: 
CREATE TABLE raw_events (   timestamp DateTime,   user_id UInt32,   device_id String,   ip_address String,   user_agent String,   event_type String,   event_properties JSON,   ... ); 

The safe_analytics table is small, fast, and pre-aggregated. Queries run in 100–300ms instead of seconds. The LLM can't accidentally do a 3-minute full scan because the data is already rolled up.

Pattern C: Query Budget Enforcement

Before an LLM-generated query runs, validate it:

Instead of executing whatever the LLM asks for, you:

  • Let the LLM generate builder calls (or pick a function like getEventsByDay).

  • Inspect the resulting query’s shape (estimated rows, time, presence of filters).

  • Reject or adjust anything that exceeds your budget before it hits ClickHouse.

Conceptually:

import { QueryBuilder } from '@hypequery/clickhouse';

class QueryBudgetError extends Error {}

/**
 * Execute a hypequery builder only if it fits within a cost/latency budget.
 */
export async function executeWithBudget<T>(
  builder: QueryBuilder<T>,
  opts: { maxRows?: number; maxMs?: number } = {}
): Promise<T[]> {
  const { maxRows = 10_000_000, maxMs = 5_000 } = opts;

  // 1) Ask ClickHouse for an execution estimate via hypequery
  //    Assume hypequery exposes a typed explain() that returns cost metadata
  const plan = await builder.explain(); 
  // e.g. plan = { readRows: number; estimatedTimeMs: number; sql: string }

  // 2) Enforce row-scan budget
  if (plan.readRows > maxRows) {
    throw new QueryBudgetError(
      `Estimated scan ${plan.readRows.toLocaleString()} rows (> ${maxRows.toLocaleString()}). ` +
        `Ask for a smaller time range or narrower filter.`
    );
  }

  // 3) Enforce latency budget
  if (plan.estimatedTimeMs > maxMs) {
    throw new QueryBudgetError(
      `Estimated time ${plan.estimatedTimeMs}ms (> ${maxMs}ms). ` +
        `Consider using a pre-aggregated metric instead.`
    );
  }

  // 4) Optional: ensure there’s a partition filter like date/event_time
  if (!/WHERE\s+(.*date|event_time)/i.test(plan.sql)) {
    throw new QueryBudgetError(
      `Query must filter on date/event_time. Add something like WHERE date >= today() - 7.`
    );
  }

  // 5) Only now, actually execute
  return builder.execute();
}

This is your safety net.

The 3-Step Rollout

Week 1–2: Safe Read-Only Access

  • Grant the LLM read-only access to a single materialized view (e.g., daily_metrics).

  • No templates yet; just let it explore within a bounded schema.

  • Monitor: Look at system.query_log for slow queries or patterns that surprise you.

Week 3–4: Template Layer

  • Add 3–5 hand-written query templates covering 80% of your actual use cases.

  • LLM chooses templates and fills parameters.

  • Monitor: 95% of queries should hit templates; 5% should be exceptions.

Week 5+: Type-Safe Generation (Optional)

  • If templates feel limiting, move to type-safe query generation (hypequery, SQLx, etc.).

  • The type system guarantees that hallucinations are caught at compile time, not runtime.

  • Result: AI flexibility with safety built in.

The Bottom Line

LLM + database is going to be the default. But "default" doesn't mean "unguarded."

Start narrow. Start safe. Expand intentionally.

Your cluster will thank you.

Till next time,

Faster Analytics Fridays

Keep Reading

No posts found