Why AI Agent Pilots Stall in Switzerland — and the Four Decisions That Determine Whether They Scale
Fewer than 1 in 4 organisations experimenting with AI agents have scaled one to production. The gap isn't technical. It's four architectural decisions made in the pilot phase that are expensive to undo later.
Why AI Agent Pilots Stall in Switzerland — and the Four Decisions That Determine Whether They Scale
Fewer than 1 in 4 organisations experimenting with AI agents have scaled one to production. This isn't a technology problem. The models work. The APIs are stable. The gap is organisational: four decisions made in the pilot phase that feel low-stakes at the time and become expensive to undo once the pilot ends.
This post describes those four decisions, why they're easy to get wrong in a pilot, and what making them correctly looks like in practice. The data points are from the Swiss market. The patterns apply anywhere.
The pilot problem
67% of Swiss SMEs with fewer than 250 employees plan to integrate at least one AI tool into their processes by the end of 2026. Only 18% have a structured plan for how to do it.
That gap — 67% intent, 18% structure — is where pilots live. A pilot is what you build when you have intent but not structure. It's isolated from production data, exempted from compliance review, scoped to a single workflow, and measured by whether the output looks plausible. It answers the question: "can this work?" It doesn't answer: "how do we run this?"
88% of early AI adopters report positive ROI on at least one generative AI use case. But scaling from one working use case to an agent that runs reliably in production requires answering four questions the pilot never asked.
Decision 1: Where does the agent read from?
In a pilot, the agent reads from a curated dataset. A spreadsheet export. A cleaned sample. A subset of the database with no PII. The output looks good because the input was designed to produce good output.
In production, the agent reads from operational data: live records with inconsistent formatting, missing fields, legacy entries from 2012 that don't match your current schema, and customer names that contain characters your text parser mangles.
The decision to make in the pilot phase: what is the agent's data access model? Not "what data do we give it for the demo" — what is the access pattern that will run in production?
This means answering:
- Does the agent query a database directly, or does it go through an abstraction layer that normalises and validates inputs?
- What happens when a required field is null?
- Which records are out of scope — and does the agent know they're out of scope before it retrieves them, or after?
Teams that design the access model during the pilot can move to production in weeks. Teams that design it after the pilot finishes are rebuilding the agent from the data layer up.
Decision 2: Who is accountable for the agent's actions?
In a pilot, accountability is informal. The agent does something wrong; you fix the prompt. No audit trail required. No one outside the project team sees the output.
In production, every action the agent takes is a business action. If it updates a record, creates an invoice, sends a message, or archives a document, someone is responsible for that action under Swiss law. Under the revised nDSG, SMEs deploying automated processing of personal data without adequate oversight risk fines of up to CHF 250,000 per violation. Only 34% of Swiss companies have defined clear rules for which data employees — let alone agents — are permitted to use.
The decision to make: who approves what, and at which threshold?
This isn't about adding a confirmation dialog to everything. An agent that requires human approval for every action isn't an agent; it's a slow form. The right question is: what is the set of actions where autonomous execution is acceptable, and what is the set where human approval is required before execution? We've written in detail about how to implement this technically — approval gates in an agent loop — but the technical implementation only makes sense once you've answered the policy question first.
Define the accountability boundary during the pilot. It determines your entire human-in-the-loop architecture.
Decision 3: What is the failure mode?
In a pilot, failure is visible. The agent produces bad output; you see it; you debug it. The pilot environment is small enough that nothing goes too wrong before someone notices.
In production, failure can be invisible for days. The agent processes 200 records overnight. 12 of them have an edge case your pilot never tested. The error is silent — the agent completes successfully, writes a result, and moves on. By the time someone checks the output, the damage has compounded.
The decision to make: what does this agent do when it's uncertain, and how does the system know when that's happening?
Concretely:
- Does the agent have a tool for expressing uncertainty — something that produces a human-reviewable output instead of a committed result?
- Is there an audit log that records not just what the agent did, but what data it read and what intermediate steps it took?
- What's the rollback path? If the agent's output turns out to be wrong, can you undo the last N actions? Can you undo them for a specific record without touching the others?
Teams that design for visible failure during the pilot build agents they trust in production. Teams that only test the happy path spend the first six months of production doing incident response.
For AI agents handling sensitive data in Swiss SMEs, the failure mode question intersects directly with data protection obligations. An agent that silently misprocesses personal data is a liability before it's a technical problem.
Decision 4: What does "working" mean in production?
In a pilot, "working" means the output looks right to the person reviewing it. That's a sufficient criterion for a proof of concept. It's not a sufficient criterion for a production system.
In production, you need a measurable definition. Not "the output looks good" — a specific metric that tells you whether the agent is performing correctly, degrading over time, or failing on a class of inputs you haven't seen before.
What that metric looks like depends on the use case. For a document processing agent: precision and recall on extracted fields, measured weekly against a ground-truth sample. For a customer correspondence agent: rate of human overrides as a fraction of total drafts produced. For a data entry agent: error rate per record type, broken down by field.
34% of Swiss companies now use AI to automate specific work steps, up from 23% in 2024. The companies moving from 23% to 34% are measuring something. The companies that tried and stopped weren't.
Define your success metric during the pilot. Run it against pilot output to establish a baseline. Then track it in production. If the metric degrades, you have a signal. If it improves, you have a number to show.
What scaling actually requires
To move an AI agent from pilot to production, you need four things in place before you flip the switch:
- A data access model designed for operational data, not curated samples — with defined behaviour for nulls, schema mismatches, and out-of-scope records.
- An accountability framework that defines the approval boundary: which actions the agent takes autonomously and which require human confirmation before execution.
- A failure design — an uncertainty signal, an audit log, and a rollback path — so that when the agent processes something it shouldn't, you find out quickly.
- A production metric established during the pilot as a baseline, tracked automatically in production, with an alert threshold that triggers review.
None of these are complex. All of them take time to define correctly, and the time to define them is during the pilot — not after it.
The difference between a Swiss company that scales AI agents and one that cycles through pilots indefinitely is usually one of these four decisions, made late or not at all.
If you're in the middle of an AI pilot and want to pressure-test it against these four questions — book a free AI Potenzial-Check. Bring the pilot. We'll tell you what breaks when you run it in production.