LLM Truncated JSON: The finish_reason Gotcha That Bridges OpenAI and Vertex
An LLM truncated JSON response surfaced a cryptic Unterminated string error in production. The fix wasn't bigger max_tokens — it was a finish_reason guard that handles OpenAI's length and Vertex/Gemini's MAX_TOKENS in the same code path.
LLM Truncated JSON: The finish_reason Gotcha That Bridges OpenAI and Vertex
A production endpoint started returning JSONDecodeError: Unterminated string starting at: line 47 column 12 (char 1832) to end users. The endpoint generated a German interview-prep guide — five categories, sample questions, suggested talking points — by asking the model for a structured JSON object and parsing it on the way back. Outputs looked fine in development. They looked fine in staging. The error showed up in production, and only in German, and only on the longer guides.
The root cause was not a parser bug. The model was hitting max_tokens mid-response and slicing the JSON in half. The fix took five lines and one parametrised test. The understanding — why this slipped past our tests, why our LLM truncated JSON guard didn't fire on Vertex AI even though it fired on OpenAI — took the rest of the afternoon. This is the writeup.
The bug surfaced inside Wield, our recruiting-pipeline product (the cvflow codebase internally), where the interview-prep service asks an LLM to produce a JSON guide of roughly 6,000 to 8,000 output tokens. The same pattern lurks in every Python service that talks to multiple LLM vendors through openai.AsyncOpenAI against a Vertex- or OpenRouter-compatible endpoint. If you don't centralise truncation detection, the symptom is a long-tail of mystery JSON parse errors in your error tracker, all from the same code path, none reproducible on demand.
Why a Healthy LLM Call Returns Broken JSON
The contract for chat completions with structured output is straightforward: ask the model for a JSON object via response_format={"type": "json_object"}, get back something json.loads can handle. What the contract does not promise — and what an awful lot of code silently assumes — is that the model will fit its answer inside the token budget you set.
When the model runs out of tokens, the response still comes back. The HTTP status is 200. The choices[0].message.content field contains a string. The string is exactly as long as your max_tokens allowed, which means it was cut mid-character, mid-key, mid-value, mid-array. json.loads then dies with one of the unhelpful messages the standard library reserves for hostile input: Unterminated string, Expecting value, Expecting property name enclosed in double quotes. The line and column numbers point inside the model's response, which is not in your repo, so the error tells you nothing about how to fix it.
The reason this passes tests is that during development you write short prompts that produce short responses. Our German guides hit max_tokens=4096 because German sentences are roughly 10 to 15 percent longer than English ones, the structured prompt asked for five categories instead of three, and the model was helpfully verbose with examples. The English tests never tripped the cap.
The signal you need is sitting one field over. The OpenAI SDK exposes a finish_reason on each choice, documented here. On a clean stop it's "stop". On a tool call it's "tool_calls". On a truncated response it's "length". Every truncated answer carries the diagnostic; nothing in the SDK forces you to look at it before parsing.
The First Fix Wasn't the Real Fix
The obvious move is to raise max_tokens until the guide fits. We did that — bumped from 4096 to 8192 — and the immediate Sentry issue went quiet. Real guides are 6,000 to 8,000 output tokens, so 8,192 is enough headroom for ninety-nine percent of cases. We could have stopped there. We didn't, because the same code path runs across a dozen prompt templates, each with its own max_tokens, and we'd just be waiting for the next one to clip a German edge case.
The right fix is centralisation. The LLM provider registry — the thin layer that wraps openai.AsyncOpenAI and dispatches across providers — is the only place where every completion call passes through. That's where the truncation guard belongs:
choice = response.choices[0] if response.choices else None
message = getattr(choice, "message", None) if choice else None
content = getattr(message, "content", None) if message else None
finish_reason = getattr(choice, "finish_reason", "unknown") if choice else "no_choices"
if content is None:
raise ServiceError(
service="llm_registry",
operation="complete",
message=f"LLM returned empty content ({provider}/{model}, finish_reason={finish_reason})",
)
if str(finish_reason or "").lower() in ("length", "max_tokens"):
raise ServiceError(
service="llm_registry",
operation="complete",
message=(
f"LLM response truncated at max_tokens={max_tokens} "
f"({provider}/{model}, finish_reason={finish_reason}). "
"Raise max_tokens or shorten the prompt."
),
)
This turns the JSONDecodeError 200 metres downstream into a typed ServiceError at the boundary, with the provider, the model, the max_tokens cap, and the actual finish_reason in the message. The caller — InterviewPrepService.generate_questions in our case — surfaces it to the user as "the guide didn't fit, retry with fewer categories" instead of "Unterminated string." Crucially, the partial JSON never reaches json.loads; we fail at the layer that has the context to fix the failure.
The same instinct shows up in our FastAPI rate limit headers post: when a middleware layer silently swaps behaviour based on the values it finds in kwargs, the bug always lands two abstraction levels deeper than the cause. Catching the signal at the call site, with the call site's context, is cheaper than reverse-engineering from a stack trace.
The Cross-Provider Catch That Almost Made the Guard Dead Code
We shipped the guard. Sentry stayed quiet for a day. Then the same Unterminated string error came back, from the same endpoint, on the same German guides.
The strict check was finish_reason == "length". That string is the OpenAI Chat Completions literal. Our production path doesn't run on OpenAI. We dropped the direct OpenAI provider months ago and consolidated on Vertex AI for Gemini. The Vertex endpoint speaks the OpenAI-compatible protocol via Google's shim, which is great for ninety-five percent of fields but leaks the native enum on finish_reason. Gemini's enum — documented in the Vertex AI Gemini API reference — uses MAX_TOKENS, not length. The compat shim does not translate it. Our guard matched a string that the production path never emits.
This is the failure mode that makes cross-vendor LLM code dangerous. The compatibility layer is opt-in: it translates most fields into OpenAI shapes, leaves the rest as their native values, and does not tell you which is which. If you trust the wrapper end-to-end, you write code that runs against a synthetic interface that no real vendor produces. The unit tests pass because the mock library returns whatever literal you typed.
The real check is case-insensitive against both literals:
if str(finish_reason or "").lower() in ("length", "max_tokens"):
...
"MAX_TOKENS" from Vertex, "max_tokens" from a forward-compatible OpenRouter route, "Length" from a future SDK that decides to title-case its enums, "length" from OpenAI itself — all collapse to the same branch. The regression test is parametrised over all four spellings, which is the only way to lock in cross-provider coverage without four separate fixtures:
@pytest.mark.parametrize(
"finish_reason",
["length", "MAX_TOKENS", "max_tokens", "Length"],
)
async def test_complete_raises_when_response_truncated_at_max_tokens(finish_reason):
...
Test parametrisation is the smallest possible insurance against the next vendor's quirky enum casing. If you only test the literal that your current provider emits, you are writing a guard that the next migration will silently disable.
Three Things to Steal for Your Own LLM Wrapper
The headline lesson is one line: check finish_reason before you parse. The reasons it generalises are worth the longer read.
Centralise the truncation guard at the provider boundary. Every completion call goes through one wrapper. Putting the guard at the per-caller layer means twenty implementations, twenty test files, and one of them is always out of date. Putting it at the wrapper means one regression test guards every prompt template you will ever ship. The same principle drives the schema-level tool filtering we wrote about in AI agent approval gates in Next.js: the policy belongs at the boundary, not at each call site.
Treat the OpenAI-compat shim as a leaky abstraction. Vertex, OpenRouter, Azure OpenAI, and most "OpenAI-compatible" endpoints translate the convenient fields and leak the rest. finish_reason is one of the leakiest. So is usage.prompt_tokens versus usage.input_tokens. Audit every field your code branches on. The cheap fix is to normalise at the boundary — lowercase the enum, alias the token-count fields — so the downstream code never has to know which vendor it's talking to.
Parametrise the test, not the literal. Cross-provider regressions are nearly always casing or naming differences. A pytest.mark.parametrize over ["length", "MAX_TOKENS", "max_tokens", "Length"] is three lines of test code and covers the entire set of plausible vendor variations. Failing tests are easier to fix than mystery production errors, and the test fixtures double as documentation for the next engineer wondering why the comparison is case-insensitive.
The same pattern that broke our German interview guide is the pattern that will break any structured-output workload that crosses vendor boundaries. If you are designing the data path for an LLM-backed feature and want a second pair of eyes on the boundary code, the wrapper layer, and the truncation handling before it lands in production — book a free AI Potenzial-Check — or read how we think about scaling AI agents from pilot to production for the broader architectural context.
wield · The recruiting pipeline that actually scales with your volume.
CV pipeline with AI dossier generation and evaluation. For recruiters sorting a hundred applications an hour — without losing quality.