Games: The Missing Modality for AGI
The curve that felt inexorable is bending. Up until now, each frontier model arrived with a steeper leap; now the leaps feel smaller. GPT-5, great in its own right, didn’t land with the shockwave GPT-4 did. The Bitter Lesson tells us compute is the answer, but what do you use the compute on when you’ve exhausted all the data?
The last few years in AI were extraordinary. RLHF was the unlock that turned stochastic parrots into helpful, preference-aligned assistants. Companies like Scale AI and Surge AI became the fodder-providers, fueling this growth. Even if progress were to stop here, there’s immense scope for applications across industries that would greatly improve global quality of life. But we’re not there yet. There are proven methods to take us from today’s capabilities to AGI.
A growing, almost retrospective consensus is forming: to approach general intelligence, models must go beyond static RLHF and practice in dynamic tasks that mirror real-world decision-making. Hence the shift toward Reinforcement Fine-Tuning (RFT) and pure RL after pre-training. DeepSeek‑R1, one of the first language models to be purely post-trained with RL (no supervised data) experienced a 56 percentage points improvement on its AIME‑24 benchmark performance. OpenAI’s o3 model, trained to reason through RL, outperformed both GPT-4.1 and 4.5, (released around the same time) on benchmarks as well as in the court of public opinion. RFT is qualitatively different from RLHF: instead of optimizing for human approval, an agent learns by acting and receiving rewards from the environment. Over time, that loop shapes agency, strategy, tool use, and meta-skills that static supervision rarely imparts.
The intuition is simple: if we want long-horizon planning, negotiation and collaboration with other agents, and goal-pursuit under open-ended rules and scarcity, the models need experience and not just labeled data.
Games as Interactive Environments
Current models infamously lack abilities– such as theory of mind and coordination –that are crucial for a truly generalized intelligent system. Reward-optimized games as interactive RL environments are the best classrooms to teach these cognitive skills; games compress complex patterns into learnable challenges with clear rewards and curricula. For example, Chess and Go provide well-defined arenas for long-term planning. An RL agent playing these gets a clear win/lose reward at the end, forcing it to plan many moves ahead. Similarly, multi-player cooperative games explicitly test and hone the ability to reason about others’ beliefs and intentions. Hanabi is a prominent example in multi-agent RL: success in Hanabi requires modeling teammates’ hidden knowledge and communicating with hints. RL in such settings forces an agent to infer what partners know or intend, building a rudimentary theory of mind.
Multi-agent games can also teach coordination and emergent tool use. OpenAI’s hide-and-seek experiment is striking: teams of agents in a physics world learned, via self-play, to barricade with boxes, counter with ramps, and iterate through six distinct strategy phases: behaviors unforeseen by the game creators. This kind of co-adaptive environment drives agents to outthink each other, hinting at competition and cooperation: the building blocks of social intelligence. In business terms, similar simulations could train AI assistants to coordinate across complex projects (multiple agents managing different parts, learning when to assist or delegate).
Crucially, games aren’t just training grounds, they’re becoming the benchmarks for “true” model intelligence. Static benchmarks (AIME, LiveBench, etc.) are great performance snapshots but bad at measuring general competence, and are getting overfit. Games test for a general understanding of the broader world and objectively score performance with auditable logs/replays (without human intervention). Moreover, they automatically scale with the capabilities of the systems since only the best models are pitted against each other.
Recent benchmarks like TextQuests stress intrinsic long-horizon reasoning by forbidding tools: they drop agents into 25 strategy games to examine their exploratory competence within a single session. On the other hand, TALES (Microsoft) unifies multiple text-adventure frameworks and highlights a persistent gap: even top models score roughly at or below 12% on games designed for enjoyment. At the frontier, Kaggle Game Arena (Google) is pitting leading models against each other in Chess to evaluate real-time planning, adaptation, and competitive reasoning. Finally, there’s Among AIs, a new benchmark created by 4Wall AI to measure models’ negotiation, cooperation, and deception skills. Among AIs is launching August 2025 as a live-arena where frontier agents solve tasks and/or vote out the imposter. While the existing static evals are starting to get saturated, the game-based ones deeply underscore how far agents remain from robust sequential reasoning under partial observability and delayed reward.
Through game-based RL training and evals, models acquire and test capabilities like planning, self-correction over long tasks, perspective-taking, and teamwork– something static corpora or instruction tuning don’t teach. Simulated environments offer a safe sandbox to impart these lessons and a rigorous scoreboard to tell if we’re actually getting smarter.
Human Data and Self-Play Environments
While pure self-play RL can yield superhuman strategies, it risks “alien” behaviors or reward hacking, where an AI finds loopholes to maximize reward in unintended ways (e.g., modifying tests to “pass” rather than improving solutions), if started tabula rasa. Human examples define a reference path the AI can mimic initially, making subsequent RL far less likely to veer into bizarre solutions. Introducing human gameplay data is critical to align behavior to human norms, bootstrap complex skills, and jumpstart the policy before RL optimization.
In 2019, DeepMind trained an AI agent called AlphaStar to play StarCraft II. Initially training the model through imitation learning on human game data allowed it to learn the basic micro and macro-strategies used by professional StarCraft players. The resulting agent defeated the built-in “Elite” level AI in 95% of games. This policy was then used to seed a multi-agent RL process, which allowed AlphaStar to decisively beat Team Liquid’s Grzegorz "MaNa" Komincz, one of the world’s strongest professional StarCraft players, 5-0. Another landmark multi-agent RL experiment is Meta’s CICERO Agent, created to play Diplomacy. CICERO combined a language model with strategic planning RL, trained first on human gameplay data (over 13 million human dialogue messages) to learn human-style interaction, then fine-tuned with self-play RL to plan winning moves. The resulting agent was adept at natural conversation and reached top-10% performance in human Diplomacy leagues. Without human grounding, a pure RL agent would have employed undesirable strategies like gibberish communication and pathological betrayal, rendering it unusable in human games.
In general, human trajectories prevent agents from getting stuck in odd corners of the state space and mitigate reward hacking. Humans in the loop set the alignment anchor before the model is allowed to optimize on its own. Commercially, jumpstarting the policy with human gameplay data cuts training time and compute cost; starting near human level means RL doesn’t waste millions of steps rediscovering basics.
4Wall AI: Playgrounds for AGI
Human data-labeling platforms aren’t a complete solution for pure RL in interactive environments. They’re great at powering RLHF and alignment through static datasets and feedback. But running simulated environments where an AI agent takes sequential actions is a different domain. Traditional data-labeling firms don’t ship multiplayer simulations or web-browsing sandboxes with humans in the loop; scaling that requires game development + real-time systems + RL engineering: 4Wall AI’s core strengths.
4Wall AI is building specialized language-driven games as interactive RL environments where humans play and AI agents learn. Domain experts (human users) seed the worlds (e.g., users playing/sparring with the agent in a game), generating dense trajectories. Labs jumpstart the policy with the game trajectories and improve their models through self-play in the accompanying environments.
Our team developed an AI-native game engine: a multi-agent framework that orchestrates multiple humans and AI agents across shared state in real-time. This framework already powers Spot, a personalized virtual world on our consumer platform, where 100k+ users create characters with complex backstories and immersive worlds rich in semantic data. The engine allows us to scale environment creation by easily hot-swapping rules, contexts, and reward functions; we can spin up new simulations and evals in days (not quarters) and use real human play to shape rewards. As the platform and the needs of our customers evolve, we grow out of the Spot framework into new unexplored domains.
Reinforcement fine-tuning of language models is still nascent, and there’s an industry-wide scarcity of rigorous interactive RL environments and evaluation rubrics. We provide the training substrate frontier labs can’t build alone: a world engine that spins up environments fast and proprietary interactive data from our creator-powered distribution flywheel. Labs bring the policy hooks; we bring the worlds, the players, and the RL data factory.
Want to learn more? Reach out at shreyko@4wall.ai