Your Prompt Router Is a Q-Function With Hardcoded Weights
Five places reinforcement learning quietly eats standard AI setups.
A few days ago I wrote about how reinforcement learning feels bigger than just games. Then I looked at the code for an AI routing engine I built, and I felt genuinely embarrassed. I had written a chunk of code that gave different scores to things. Like, PageRank gets 0.15, Keyword match gets 0.30, and so on.
This code was the heart of how the system decided which AI persona to show a user. Where did those numbers come from? My head. On a Tuesday. While drinking coffee.That code is deciding what every single user sees.
It is basically a reinforcement learning math function, but with numbers I just guessed. The moment you notice that, the whole conversation changes. You aren't adding a massive new AI system. You are just upgrading a function you already wrote. The standard pattern Most smart assistants today use a very set path. They figure out what you want, route it, pick a tool, and generate a response. Every step is hard-wired. New behavior means writing new code.Figuring out what a user means is mostly a solved problem now. The really interesting decisions happen between those steps. Which tool should it call first? Should it answer right away or ask a follow-up? Those are policy problems. And reinforcement learning is built perfectly for policy problems.Five places you can use this right now
Learn the ranking weights from real usage. Replace that list of guessed numbers with a simple learning model. Look at whether the user actually clicked the top result. It is very little code and learns directly from live traffic.
Personality selector. My system has different voices, like a teacher or a code reviewer. You can use the same simple model here. The reward is whether the user kept chatting or if they had to rephrase their question right away. The model can learn that frustrated users want short answers without anyone programming that specific rule.
Response style. Should the assistant be brief or walk the user through it? That instinct can be learned by rewarding the system when problems are solved in fewer turns.
Make the workflow adapt. Track what users actually do after a certain step and update the flow based on that. It makes everything smarter quietly.
Tool router. When the bot has three possible tools it could use, which one does it try first? Reward it when it finds the right answer in fewer tries.
Notice what is not on this list. No crazy deep learning. No supercomputers. No research team. You only need one thing you probably already have: A log of what the user did next. That is the whole trick.
The state is what is happening in the chat. The action is the choice the system makes. The reward is hiding in the next thing the user does in your log file. The point that surprised me most I started out wondering how to add reinforcement learning to my system. I came out realizing the system was already built for it. The states and actions were already there. The reward signals like usage counts and success rates were already in the data. I just wasn't using them to teach the system anything. It is the same realization I had watching Pac-Man. The system was always making decisions. The only question is whether those decisions get better over time, or if they stay frozen on the numbers you guessed on a Tuesday.
You don't need a research team. You just need to save the reward signal.