25 March 2026

Conducting the orchestra

How the Laravel AI SDK turns multi-agent workflows into something PHP developers already understand.

The first version of our AI feature was a single API call. One model, one prompt, one response. It worked well enough for demos. It fell apart the moment real users touched it, because real problems are not single-step problems. A customer support query that needs classification, domain-specific research, a drafted response, and a quality check before it reaches anyone is not one task. It is four tasks with dependencies, and pretending a single model call can handle all of them reliably is the kind of shortcut that ships fast and rots faster.

The version that actually worked uses three different providers (Anthropic, OpenAI, Gemini) and five agents coordinated by application code. Not because I wanted architectural complexity, but because the problem demanded specialization. And the tool that made this manageable from a Laravel codebase, without drowning in HTTP clients and provider-specific SDKs, is the Laravel AI SDK.

The pipeline that taught me the most

The Laravel AI SDK is built on top of Prism, the community package by TJ Miller that provides a fluent interface for calling language models. The SDK adds agent classes, tools, structured output, and the kind of Laravel-native integration that makes coordinating API calls to three different companies feel like writing a service class. Agents are PHP classes generated with Artisan. Switching a provider is a one-line change. The orchestration layer never knows or cares which model sits behind an agent contract.

But the abstraction only matters if you build something real with it, so here is what the core pipeline in our SaaS actually looks like.

A request comes in. The first agent is a classifier, running on a small, cheap model. Its job is triage: read the input, return a structured JSON object with a category and a complexity score, and get out of the way. This agent costs almost nothing per call and responds in under a second. Early in development, I had this running on a large model because I assumed classification needed intelligence. It did not. A small model matched the large model’s accuracy on our classification task while costing a fraction of the price. That was the first lesson: in a multi-agent pipeline, waste compounds at every step. A dollar saved on a classification that runs thousands of times per day is not trivial.

The classifier’s output feeds into a match expression that routes to a specialist agent. Complex requests go to a reasoning-heavy model. Straightforward ones go to something faster. This is the strategy pattern, the same routing logic you would use in a notification system or a payment gateway. The only difference is that the workers behind the routes are language models instead of PHP classes. The SDK’s Pipeline::send()->through()->thenReturn() handles the chaining. If you have ever piped an uploaded file through validation, scanning, and processing, you have already built this architecture.

The specialist agent produces a draft. That draft then goes to an evaluator agent, a separate model with different instructions, whose only job is to score the output against a quality schema and flag specific issues. If the score falls below a threshold, the draft goes back to the specialist with the evaluator’s feedback attached. This loop runs up to three times before the system accepts the best available result and moves on.

This is where things got interesting and where things broke.

Two glowing orbs connected by oscillating light trails

Where the abstraction leaks

The evaluator loop looked elegant in development. In production, it surfaced a problem I had not anticipated: the evaluator and the specialist disagreed about what “good” meant. The evaluator would flag a response as too verbose. The specialist would rewrite it shorter. The evaluator would then flag it as lacking detail. Two agents, both following their instructions correctly, caught in a loop that oscillated instead of converging.

The fix was not in the prompts. It was in the architecture. I added a feedback accumulator that tracked previous evaluation results across iterations, so the specialist could see the full history of complaints rather than only the most recent one. That context prevented the oscillation. The lesson was that prompt engineering is table stakes in a multi-agent system. The hard problems are in the interfaces between agents: what data flows from one to the next, what context gets preserved, and what gets lost.

Debugging is the other place where the orchestration model strains. When a single model produces a bad response, you look at the prompt and the output. When a five-agent pipeline produces a bad response, the cause could be in any stage. The classifier might have miscategorized the input. The router might have sent it to the wrong specialist. The specialist might have produced a reasonable draft that the evaluator scored incorrectly. Or the evaluator might have been right and the specialist failed to incorporate its feedback. Tracing a failure through a pipeline is closer to debugging a distributed system than debugging a function call. I ended up logging the full input and output of every agent at every stage, which made the pipeline transparent but also made me realize how much observability infrastructure multi-agent systems demand. If you are not prepared to build that tooling, you are not prepared to run these systems in production.

There is also a cost dimension that surprised me. The evaluator loop, when it runs all three iterations, triples the token cost of that pipeline stage. For most requests the first draft passes. But for the 15% of requests that trigger the full loop, the cost per request jumps significantly. I had to add a circuit breaker that skips the evaluator for low-complexity requests where the quality variance is small. This is not a pattern the SDK teaches you. It is a pattern the invoice teaches you.

The argument for smaller models in bigger roles

Most writing about AI in production defaults to the assumption that you want the most powerful model you can afford. In a multi-agent system, the opposite is often true.

The classifier in our pipeline is the most important agent in the system. Every downstream decision depends on its output. If it miscategorizes a request, the wrong specialist picks it up, the draft misses the mark, and the evaluator either catches it (expensive) or does not (worse). Despite being the most critical agent, it runs on the cheapest model in the stack. This was counterintuitive until I benchmarked it. For structured classification against a known taxonomy, the small model matched the large model’s F1 score to within two percentage points. The large model was not better at this task. It was just more expensive at it.

The same principle applied to the evaluator. Its job is narrow: score a response against a rubric and return structured JSON. It does not need to generate creative text or reason through ambiguous problems. It needs to be consistent and fast. A mid-tier model with strong schema adherence turned out to be more reliable for this role than a frontier model that occasionally hallucinated extra fields or returned scores outside the defined range.

The expensive models earn their cost in the specialist role, where the task is genuinely hard and the quality ceiling matters. But even there, the SDK’s failover support means I can define a preferred model with automatic fallback to alternatives. If Anthropic’s API is slow or unavailable, the agent falls back to OpenAI transparently. The orchestration logic never changes.

This composition, cheap triage, mid-tier evaluation, expensive reasoning, is not something you arrive at by picking the best model. You arrive at it by treating model selection as a resource allocation problem, the same way you would decide which queries deserve a cache layer and which can hit the database directly.

What conducting actually feels like

The orchestra metaphor has limits, and it is worth being honest about where it stops being useful. A real conductor works with musicians who understand music. Language models do not understand your domain, your users, or the agents sitting next to them in the pipeline. They follow instructions with varying degrees of reliability, and when they fail, they fail silently. A musician who plays a wrong note knows it. A model that produces a confident, well-structured, completely wrong classification does not.

Conducting a multi-agent system is less like leading an ensemble and more like designing a factory floor. You are deciding which machine handles which operation, what quality checks happen between stations, and what happens when a machine produces a defective part. The elegance is in the routing, the tolerances, and the feedback loops, not in any individual station’s performance.

What makes this work in Laravel specifically is that the framework already speaks this language. Pipelines, concurrent execution with Concurrency::run(), service containers, middleware, structured validation. The AI SDK does not introduce a new paradigm. It gives you agent classes that slot into the paradigm you have been using for years. Agents are testable. Tools are injectable. Middleware wraps agent calls the same way it wraps HTTP requests. The gap between “I can build a Laravel app” and “I can orchestrate a multi-agent system” is mostly the realization that you already have the architectural vocabulary for it.

But vocabulary is not fluency. Knowing that Pipeline can chain agents does not tell you how to handle feedback loops between them. Knowing that Concurrency::run() can parallelize agents does not tell you how to synthesize contradictory results from parallel evaluations. The patterns are a starting point. The production system is what you build when the patterns collide with real users, real costs, and real failure modes.

The models are instruments, but the application is the score. What I took away from building this system is that multi-agent orchestration is not an AI problem. It is a systems design problem that happens to use AI as a resource. The SDK makes the mechanics clean. The hard part, the part no framework solves for you, is deciding which agent hears what, when to trust its output, and what to do when it gets things wrong. That is the conductor’s real job, and it is more engineering than it is magic.