·LLM Integration·3 min read·Mid-level developers

The Feature Flag Playbook for Rolling Out LLM Features Safely

LLM features fail differently to normal code. They get slow, they get expensive, they get weird. Three flag patterns that let you ship them without a 3am rollback.

A teammate shipped a "summarise this thread" feature behind a single on/off flag. Two hours after the 50% rollout, OpenAI had a regional outage and half the inbox UI hung waiting on a Promise that never resolved. The rollback worked. The postmortem was a very long meeting.

We changed the playbook the next week.

Why a single flag isn't enough

A normal feature flag is binary. Feature on, feature off. An LLM feature has at least four failure modes that need independent control. Provider availability, cost spikes, latency degradation, and output quality regression. One Boolean cannot express "ship to 20% of users but only on the cheap model and only if median latency stays under 2 seconds."

You need three flags, not one. Treat them as a small policy, not a switch.

The three-flag pattern

interface LLMFeatureFlags {
  enabled: boolean;          // master kill switch
  rollout_percent: number;   // 0-100, who sees it
  model_tier: "premium" | "standard" | "fallback";  // cost lever
}

async function summarize(threadId: string, user: User): Promise<string | null> {
  const flags = await flagService.evaluate("thread_summary", user);
  if (!flags.enabled) return null;
  if (hash(user.id) % 100 >= flags.rollout_percent) return null;

  const model = MODEL_BY_TIER[flags.model_tier];
  try {
    return await withTimeout(
      llm.summarize(threadId, { model }),
      flags.model_tier === "fallback" ? 8000 : 4000,
    );
  } catch (err) {
    metrics.increment("llm.summary.error", { model });
    return null;  // graceful degradation
  }
}

Why it works

The master switch handles incidents. Flip it and the feature disappears in seconds. The rollout percentage handles canaries. Ship to 1%, watch the dashboards, ramp from there. The model tier handles cost and provider failures. When GPT-4 has an outage you flip premium → fallback and switch to a cheaper, slower, more available model without redeploying anything. Three orthogonal levers cover the three real failure modes.

When to use the full pattern

Any user-facing LLM feature with cost or latency sensitivity, which is most of them. Internal tools can usually skip the percent rollout. Keep the master and tier flags even there. The cost lever pays for itself the first time you avoid a midnight bill spike. I've seen it happen.

When not to

Prototypes and internal demos that you can just turn off entirely. Also don't introduce this pattern reactively after an incident. Add it before the feature touches users, when the cost of designing the policy is cheap.

How a flag decision flows

flowchart TD
    Req[User request] --> M{enabled?}
    M -- No --> Off[Return null<br/>fall back to non-LLM UX]
    M -- Yes --> R{hash<br/>rollout_percent?}
    R -- Out --> Off
    R -- In --> T{model_tier}
    T -- premium --> P[GPT-4 path<br/>4s timeout]
    T -- standard --> S[GPT-4o-mini path<br/>4s timeout]
    T -- fallback --> F[Cheap model<br/>8s timeout]
    P --> RsP[Return]
    S --> RsS[Return]
    F --> RsF[Return]
    P -. error .- Off
    S -. error .- Off
    F -. error .- Off

Conclusion

Wire the kill switch first, before the feature even works. A flag that says "off" everywhere costs you nothing today and saves a deploy in the middle of your next provider incident. There will be a next one. I would bet money on it. I have, technically.