Making an AI Play Pac-Man (Because a Web App Was Too Mainstream)
Making an AI Play . Pac-Man (Because a Web App Was Too Mainstream)
Picture this: it is 2020, final semester of engineering. Half my batch is churning out library management web apps, and the other half is copying and pasting chat applications, just like we do when we want to secure the grade and graduate.
And then there is me. I decided to build a Reinforcement Learning (RL) agent all alone, completely oblivious to whether my laptop or my sanity could actually handle it.
I had been obsessing over OpenAI for a while before this. The one-legged hopper videos. The cheetah running in the physics simulator. Reading those early papers, you could practically feel them figuring things out in public. Something about that felt deeply relatable. The uncertainty, the endless iteration, and that magical moment when something just works and you do not fully know why yet.
So, I pitched RL for my final year project. Specifically, Deep Q-Networks (DQN). Specifically, . Pac-Man.
Google Colab kept freezing on me. I eventually moved to Paperspace. I used a Youtube + TensorFlow tutorial as my holy grail and spent weeks just staring at the code before I dared to change a single line. Not out of discipline, mind you, but because I was terrified of breaking something I did not entirely understand yet.
But by the end of it, I understood way more than I expected. This is me documenting exactly how that happened.
First off, Why Reinforcement Learning?
Most of us wrap our heads around Supervised Learning first. You hand the model an input (X) and an answer (Y), and it learns the mapping. Simple enough.
But what if you do not have Y? What if nobody on earth can tell you the right answer, and you just have to figure it out by messing around and finding out?
That is Reinforcement Learning.
Supervised: Given X and Y, learn to map X to Y.
Unsupervised: Given only X, find some hidden pattern.
Reinforcement: Given X and a score Z, figure out Y yourself.
For a game like . Pac-Man, nobody is sitting there labeling, hey, in this exact pixel configuration, go left. The agent has to learn by playing. It dies, it scores a point, it dies again, and very slowly figures out what keeps it alive. Honestly, that felt way more compelling to me than another classification project.
What the Agent Actually Sees
The game runs at 60 FPS in full RGB. That is an overwhelming amount of data to process. So, the first step is making it digestible.
We convert each frame to grayscale. We go from three color channels down to one, which saves memory and slashes complexity. But a single grayscale image has one fatal flaw: it has no concept of time or movement. If a ghost is barreling toward you or running away, a single frozen frame looks exactly the same.
To fix this, we stack 3 consecutive frames together. By looking at the difference between them, the agent suddenly understands motion. This stacked chunk is the state and it is the actual input that gets fed into the neural network.
Game frame (RGB 210x160x3) -> Grayscale -> Stack 3 frames -> CNN (16, 32, 64 filters) -> Flatten -> Dense layers -> Q-values
The Reward Problem: The Law of Immortality
Here is a fun little quirk nobody warns you about when you start with RL.
If you just tell the agent to add up all its rewards forever, the math dictates it will eventually reach infinity:
I started calling this the Law of Immortality. The agent figures out that simply staying alive and collecting dots forever yields infinite points. So, it never takes risks. It just cowers in a corner, trying to survive at all costs, even when hiding is objectively the wrong way to win the game.
The fix? Discounted rewards. We have to teach the agent that future rewards are not worth as much as immediate ones. We multiply future rewards by a discount factor, gamma, which is just slightly less than 1.
I ended up tuning my gamma to 0.9775. At that rate, a reward 3 steps in the future is worth about 0.934, and 6 steps out, it drops to 0.872. The agent still cares about the future, but the near future matters more. It starts taking calculated risks instead of just hiding. And why 0.9775 specifically? Because the tutorial default did not work, I messed around until it did, and I am honest enough to admit it.
Q-Values: The Agent Internal Monologue
At every single frame, the agent looks at the board and asks itself: If I take this specific action right now, how much total reward am I going to get eventually? These estimates are Q-values.
. Pac-Man has 9 possible actions: NOOP, UP, RIGHT, LEFT, DOWN, and the diagonals. At one point during training, the agent internal math looked like this:
NOOP: 3.929
UP: 3.892
RIGHT: 4.136
LEFT: 4.221
DOWN: 3.191
UPRIGHT: 4.634
UPLEFT: 7.977 (Action Taken)
DOWNRIGHT: 3.414
It chose UPLEFT because thousands of brutal, ghost-related deaths had taught it that going UPLEFT here leads to the biggest long-term payout.
Three frames later, in roughly the same situation, the Q-value for UPLEFT dropped to 6.595. That is the discount factor doing its job. It is the exact same direction, but the agent is slightly less confident because we are a few steps further removed from the original decision.
This is the Bellman equation at work:
Looking Inside the Brain: What the CNN Learned
My absolute favorite part of this whole project was visualizing the CNN filters. You spend hours training this black box, and then you finally peek inside to see what it actually cares about.
Layer 1 (16 filters): Super basic. It picked up the maze walls and hard edges. A lot of the filters were just black, meaning the game visuals did not trigger them at all.
Layer 2 (32 filters): This is where it got cool. You could clearly see movement traces. The network was actively learning to track the ghosts and . Pac-Man.
Layer 3 (64 filters): Mostly dark and highly sparse. This is a great sign because it means the deeper layers were only reacting to very specific, complex combinations of features.
It proved the network was not just blindly memorizing pixels. It was actually learning the spatial awareness needed to play.
The Grind: 3500 Episodes Later
I let it run for 3500 episodes. The mean Q-value started near absolute zero and just kept steadily climbing, crossing 1.75 by the end.
That upward curve is the exact dopamine hit you are looking for in RL. It proves the agent estimates of the future are getting sharper. Did it play perfectly? No. But it played deliberately. It dodged ghosts, hunted power pellets, and made logical choices. For a single GPU student project, I was thrilled.
I heavily modified the original Breakout tutorial to make this work:
Action Space: Expanded from 3 actions to 9.
Reward Shaping: Completely overhauled the reward structure to account for eating ghosts vs. dots vs. power pellet timers.
Hyperparameters: Tuned the discount factor from 0.97 to 0.9775.
Vision: Swapped the standard 4-frame stack for a 3-frame stack with custom motion tracing.
None of this was random. It was a loop of testing, failing, figuring out why it failed, and tweaking the code.
Why I Keep Coming Back to This
Honestly, I just find Reinforcement Learning beautiful.
Not beautiful in the way clean code or a minimalist UI is. It is beautiful because of the philosophy behind it. An agent drops into a world knowing absolutely nothing, flails around randomly, gets tiny signals back from the universe, and slowly, painfully slowly, learns how to survive.
It is exactly how we learn, is it not? You try something, life hands you a consequence, and you update your internal policy. Reward, penalty, repeat.
Even the discount factor feels human. We are incredibly impatient creatures who prefer a good thing right now over a great thing next week. The Bellman equation is just dynamic programming with human impatience mathematically baked in.
Sometimes I wonder if we are all just running some messy biological version of this. Optimizing a hidden reward function we do not fully understand, driven by a discount factor that makes us prioritize today over tomorrow.
Maybe that is a bit too philosophical for a technical write-up. But those were the thoughts keeping me sane while my Colab instances crashed and my agent died in the first 3 seconds of every single run.
RL changed how I view learning. It is not just absorbing facts. It is updating a policy based on friction. Every mistake is a reward signal. Every win is your Q-value going up.
Final year project, 2020. DQN on . Pac-Man. Trained on Paperspace. It was fun, I enjoyed the most
Bonus: Hallucinating with DeepDream this lead to some interest, and we implemented Google's DeepDream paper (with help from Youtube)