Premier League Betting Study Shows Leading AI Models Lose Money Consistently

Eight leading AI systems from Google, OpenAI, Anthropic, and xAI all lost money when tasked with predicting outcomes and placing bets on the 2023–24 Premier League season, according to a study released this week by AI startup General Reasoning.

Eight leading AI systems from Google, OpenAI, Anthropic, and xAI all lost money when tasked with predicting outcomes and placing bets on the 2023-24 Premier League season, according to a study released this week by AI startup General Reasoning. The test used detailed historical team data and statistics to evaluate how well the models could adapt to real-world conditions over an extended period.

General Reasoning conducted the "KellyBench" study by creating a virtual recreation of the Premier League season and providing each AI system with £100,000 in normalized bankroll. The models were instructed to build prediction systems that would maximize returns while managing risk, then place bets on match outcomes and goal totals across the season. Each AI received three separate attempts and could not access the internet to retrieve live results or updates.

Anthropic's Claude Opus 4.6 performed best among the eight systems tested, averaging an 11 percent loss across attempts and nearly breaking even on one try with a final bankroll of £89,035. Google's Gemini 3.1 Pro showed the highest variance, turning a 34 percent profit on one attempt while going bankrupt on another. xAI's Grok 4.20 failed most severely, losing all its capital on all three attempts and leaving a final bankroll of £0.

The study's authors concluded that "every frontier model we evaluated lost money over the season and many experienced ruin," with the AI systems "systematically underperforming humans" in this scenario. Ross Taylor, General Reasoning's chief executive and one of the study's authors, told researchers that the results highlight a gap between AI's performance on static benchmarks and its ability to handle real-world complexity over time. "There is so much hype about AI automation, but there's not a lot of measurement of putting AI into a longtime horizon setting," Taylor said.

The paper has not yet undergone peer review. Taylor, a former Meta AI researcher, emphasized that while AI has made significant advances in software engineering tasks, many real-world activities with longer time horizons remain areas where AI performs poorly. He noted that most existing AI benchmarks operate in " static environments" that bear little resemblance to the unpredictability of actual events.

Context

AI performance evaluation has traditionally relied on benchmarks designed to test narrow capabilities in controlled settings. The KellyBench study departs from this approach by simulating a full season of outcomes where conditions change continuously, requiring models to adapt their strategies as new player data and match results emerged. This mirrors the extended decision-making horizons that define fields like finance, where AI adoption has accelerated despite limited evidence of consistent outperformance.

The study arrives as Silicon Valley has celebrated AI's recent breakthroughs in software engineering and code generation, where systems like OpenAI's o1 and Claude have demonstrated the ability to complete complex programming tasks with minimal human input. General Reasoning's findings suggest these capabilities do not necessarily transfer to domains requiring sustained prediction and adaptation, such as sports betting or financial forecasting.

What's Next

The study raises questions about which real-world tasks are suitable for AI automation versus those where human judgment remains more reliable. General Reasoning's work may prompt other researchers to develop benchmarks that test AI performance over longer time horizons and changing conditions, rather than relying on static evaluations. Taylor's findings could influence how enterprises evaluate AI tools before deploying them in finance, logistics, and other fields where decisions accumulate over months or years rather than minutes.

Source

https://arstechnica.com/ai/2026/04/ai-models-are-terrible-at-betting-on-soccer-especially-xai-grok/

Premier League Betting Study Shows Leading AI Models Lose Money Consistently

Source

Never Miss a Signal