Different models have different strengths. Some are generally agreed upon (Claude Sonnet for front-end), but there’s a lot of overlap and subjectivity. Some models are completely overkill and are too powerful for everyday tasks. You don’t need o3 to reason for ten seconds if you just want the internal temperature of a medium-rare steak.

Model providers usually give descriptions of each model’s strengths, but there is a ton of overlap, and they just aren’t that descriptive. For example:

GPT-4o :Fast, intelligent, flexible GPT model

GPT-4.1 mini: Balanced for intelligence, speed, and cost

In a perfect world you’d use the “best” model for each job, but that’s easier said than done. You can lean on benchmark data and wire up if/else logic, but benchmarks often don’t align with user preferences.

In this article, we’ll survey the leading LLM-routing paradigms: task-based, performance-based, and rule-driven. We’ll also dive into Arch-Router, a recent 1.5 B-parameter generative approach. Let’s dive in!

There are so many LLMs today from the top labs like OpenAI, Anthropic, Grock, Gemini, and there's only more. Seems like there's a new one launching every week and it's becoming harder and harder to know when to use which model and what for. What are the models actually good at? Benchmarks really only tell you part of the story and are notoriously, you know, gamed by these companies.

And so it becomes really hard when you're an individual or working in a team to answer the question of like, hey, I'm just writing. Can I use 03? Should I use 40 or Anthropic? I'm writing code. What should I do for that? You know, and like a lot of times I feel like people are kind of overkilling it by, you know, using 03 for everything.

And so there's this really cool paper recently called Arc Router that looks to kind of solve this problem of like routing the correct query to the correct model. It's a really interesting kind of way to do things. It's based on user preferences. And so we're going to jump right in.

First talk about the different ways that you can kind of use LLM routers. And I'm not talking about stuff like OpenRouter that kind of just proxy the request to the different models. It's a great service. This is really about aligning LLM to use case.

And so one way to do that is like an intent embedding-based type of router where you create embeddings for each user message and you run like a semantic similarity vector search against a fixed set of topics. So you see a user message, you do a vector search that says, "Hey, this is a billing one, this is math, this is writing," and then you make your decision based on that to send to a certain model.

So in this case, we're really talking about probably in like a generative interface. You know, like ChatGPT isn't going to integrate something like this, but that's kind of the use case to think of—like when I'm sending this query, how to route it to the best model to get the best output. You could have this in your application as well depending on what you're building.

This is pretty easy to get up and running, but it requires retraining anytime you're trying to redefine the intents or add new models. And so it's a little bit hard to keep up with.

Another is like a cost-performance kind of based router. So when you use the benchmarks or cost data to train a router in some way that decides basically like, hey, can a cheaper model be used for this or do we need to use a stronger more expensive one? And that decision again could be based on benchmarks, or you could do it based on costs in a way as well to kind of blend those together.

This doesn't really account for subjective criteria. A lot of the benchmarks are great, but they don't always align with human preferences as we see from LLM Arena and WebDev Arena.

Rule-driven routers: hard-coded if/else logic that maps queries to models in some way that's not like a vector search. This is fast because you're just doing it with code—so ultra-low latency and transparency—but hard to maintain as it scales and having if and else scattered throughout the file.

The last one, and what Arc Router does, is basically a preference-aligned router. You write policies in plain English that say, “Hey for these type of tasks, use this type of model,” and then you use their Arc Router model to route those tasks. There's a call in between you and the LLM to Arc Router that will basically tell you which type of request this is and then direct it to the right model based on the rules that you set.

We're going to look at it, but it's human-interpretable. You can adapt it just by adding a couple lines of text and it’s really easy to get up and running.

This is a model in and of itself. It's on Hugging Face. It's 1.5 billion parameters. It's a fine-tuned version of Aquila, and it routes user queries to certain models based on whatever your policies are. No benchmarks, no if/else rules. Each policy is a simple identifier and description tuple. For example, "legal review" would be the identifier and the description might be "analyze a contract clause" etc.

Then there's a separate lookup table that maps each identifier to a model. So for legal stuff, do GPT-4. This description just helps the model better understand the actual task.

As I mentioned, it's an Aquila model and is able to get to pretty high performance. On the benchmark for these classification tasks that the researchers ran, it was at 20.7. They did fine-tuning and got it up to 93, which is basically on par or better than all the top models at the time.

Something that might jump to mind for you—definitely did for me—is how much latency does this add? On average, it's about 50 milliseconds. Interestingly, they're saying that Gemini Flash would take 510 milliseconds, and Claude around 1,400 milliseconds. So, it's super fast—50 milliseconds is basically unnoticeable.

At its core, here’s how it works: you define your policies (identifier and description), and then you have a lookup table. So when Arc Router says this is a "codegen task," it notes to then route that request to Claude Instant or whatever you’ve mapped it to.

Something they note is that this can be used throughout the conversation. You can imagine when you're chatting back and forth, you're running this every time. You can pass the message history if you want. This means you could switch mid-conversation automatically. That may not seem relevant, but you can imagine a situation where you go from generating data to analyzing data or writing code about that data. You might want to use different models for that. So that's another benefit—Arc Router can change on the fly.

The router prompt has the policies, the conversation history, the new user query, and some extra instructions which you can see below:

Your helpful assistant is designed to find the best route. Here's some information. Here's the conversation. Here are the instructions.

If the latest intent from user is relevant or user intent is fulfilled, respond with the other route. Analyze route descriptions. Find the best match. Respond with just the route name. Right? Because we're going to take that and pass that into our lookup table.

That's it. It takes the user prompts, outputs one of the policy identifiers, and then it goes to the lookup table. That sends the request to the relevant LLM based on that identifier.

Since the policy lives in the prompt itself, it’s really easy to update these—especially as new models come out or if your preferences change. If there's any drift or something like that, you just tweak the prompt.

Really fun. It’d be fun to hack a project together that tests this across conversations and see how it switches. You could imagine setting up an automated system to do this as well. If you had an open-source version of ChatGPT running, you could make sure anytime a teammate interacts with AI, it's doing it in the right way.

We see a lot of times users have a bad experience with AI—maybe because they're using the wrong model or it's overkill for what they’re trying to get done. It can be confusing, right? All the names, the different models—it makes people’s eyes glaze over. It does for me and I’m pretty deep in this world. That’s what Arc Router could help solve as well.

So, a lot of interesting stuff here. I hope you enjoyed this one. We’ll have a bunch of links below to the projects, resources, everything along those lines. If you want to get started with it, try it below. See you.

Different types of LLM routers

Generally speaking, there are 4 different types of LLM routers.

1. Intent/Embedding-based routers

Create embeddings for each user message and run a semantic similarity vector search against a fixed set of topics (e.g. “billing,” “SQL,” “math”). The closest intent determines which model handles the request.

  • Examples:
    • OrchestraLLM retrieves the 𝑘 most similar dialogue examples by embedding similarity, then routes by majority vote among expert models
    • Custom in-house pipelines
  • Pros & Cons:
    • ✅ Fast and easy to prototype.
    • ❌ Brittle to topic drift and multi-turn context, and requires retraining whenever you add or redefine intents.

2. Cost/Performance-based routers

Use benchmark or cost–accuracy data to train a router that decides if a cheaper model works for a query or if it should escalate to a stronger, more expensive one. These routers tend to focus on cutting costs and not using a large model for every task

  • Examples
  • Pros & Cons:
    • ✅ Optimizes spend versus quality in controlled tasks.
    • ❌ Ignores subjective criteria (tone, style, brand)

3. Rule-driven Routers

Hard-coded if/else logic that map queries to models in some way.

  • Example
    • Custom implementation
  • Pros & Cons
    • ✅ Ultra-low latency and full transparency
    • ❌ Maintenance nightmare at scale as use cases and models expand

4. Preference-aligned routers

This one will be the major topic of today, as this is what Arch-Router does. Users write route policies in plain-English, pair with model choices. A small LLM ingests these policies and the user message(s) and return the policy that best matches each query.

  • Examples:
    • Arch-Router
  • Pros & Cons
    • ✅ Human-interpretable, adapts immediately to new policies without retraining, and handles multi-turn drift gracefully.
    • ⚠️ Requires a lightweight generative model and well-crafted policy descriptions to work effectively.

Introducing Arch-Router

Arch-Router is a lightweight, 1.5B-parameter model that routes user queries to user-defined models by following plain-English route policies, rather than benchmarks or if/else rules. Each policy is a simple (identifier, description) tuple, for example ("legal_review", "Analyze a contract clause…"). A separate lookup table maps each identifier to its chosen model (e.g. legal_review → GPT-4o-mini). More examples below.

Under the hood, Arch-Router is a fine-tuned Qwen 2.5 (1.5 B) model. After training on a mix of clean and noisy policy–dialogue data, its routing accuracy jumps from about 20.7 % off-the-shelf to over 93%.

table showing accuracy results across models

Since Arch-Router is so small and fine-tuned for the task, it only adds ~50ms of latency, on average. The next-fastest commercial router (Gemini-2.0-flash-lite) takes about 510 ± 82 ms, and Claude-sonnet-3.7 takes 1,450 ± 385 ms.

Table showing latency across models

How Arch Router works

At its core, Arch-Router routes queries to a given LLM by following human-written policies. Here’s how it works, step-by-step:

Define policies

A really simple document that lays out a set of (identifier, description) tuples, and the related models.

C = {
 (“code_gen”,“Generate code snippets or boilerplate.”),
 (“summarize”,“Produce a concise summary of this text.”),
 (“hotel_search”,“Find and recommend hotels in a given city.”),
 (“default”,“Fallback for any other queries.”)
}
T(code_gen)     = Claude-sonnet-3.7
T(summarize)    = GPT-4o
T(hotel_search) = Gemma-3
T(default)      = Qwen2.5-4B

Compose the router prompt

  • The policies
  • The conversation history
  • The new user query + some extra instructions (see prompt below)

You are a helpful assistant designed to find the best suited
route.
You are provided with route description within
<routes></routes> XML tags:


<routes>
\n{routes}\n
</routes>


<conversation>
\n{conversation}\n
</conversation>Your task is to decide which route is
best suit with user intent on the conversation in
<conversation></conversation> XML tags.
Follow the instruction:
1. If the latest intent from user is irrelevant or user
intent is full filled, respond with other route {"route":
"other"}.
2. Analyze the route descriptions and find the best match
route for user latest intent.
3. Respond only with the route name that best matches the
user’s request, using the exact name in the <routes> block.


Based on your analysis, provide your response in the
following JSON format if you decide to match any route:


{"route":  "route_name"}

Generate and dispatch

Arch-Router ingests the user’s prompt and outputs one of the policy identifiers (e.g. code_gen). That identifier is handed off to a mapping function which looks up the LLM you mapped to that given policy, and the request is sent.

Since the policy lives in the prompt itself, it is really easy to add, remove, and edit routes and models.

Conclusion

The problem of using the right model for the right job isn’t going anywhere, and will probably only get more confusing in the future. While it isn’t possible to implement Arch-Router directly into Claude or ChatGPT, maybe in the future there will be easier ways to set up smarter routing for whole organizations.

Arch-Router seems to be the best router I’ve seen given how flexible and aligned it is with actual human preferences and how minimal of a latency hit there is.

Headshot of PromptHub Co-founder Dan Cleary
Dan Cleary
Founder