Premier League Betting Test Exposes Limits of Advanced AI Systems

Eight leading AI models lost money attempting to predict Premier League soccer match outcomes during the 2023–24 season in a study released this week by General Reasoning.

Eight leading AI models lost money attempting to predict Premier League soccer match outcomes during the 2023-24 season in a study released this week by General Reasoning. The test, which provided the systems with historical team data and instructed them to maximize returns while managing risk, found that every model underperformed human bettors, with xAI's Grok 4.20 losing its entire £100,000 starting bankroll across all three attempts.

General Reasoning, a London-based AI start-up, conducted the "KellyBench" study by creating a virtual recreation of the Premier League season and tasking eight AI systems with building profitable betting models. The AI agents received detailed historical statistics about teams and previous matches, then placed bets on match outcomes and goal totals as the season progressed. Each model started with a normalized £100,000 bankroll and received three attempts to turn a profit.

Anthropic's Claude Opus 4.6 performed best among the systems tested, averaging an 11 percent loss across its three attempts but nearly breaking even on one try with a final bankroll of £89,035. OpenAI's GPT-5.4 averaged a 13.6 percent loss. Google's Gemini 3.1 Pro achieved a 33.7 percent profit on its best attempt but went bankrupt on another, resulting in a mean loss of 43.3 percent. The worst performers included xAI's Grok, which lost 100 percent of its bankroll on every attempt, and Acree Trinity, which also went bankrupt entirely.

The study's authors concluded that "every frontier model we evaluated lost money over the season and many experienced ruin," with AI "systematically underperforming humans" in the scenario. Ross Taylor, General Reasoning's chief executive and a former Meta AI researcher, told researchers that the results highlight a gap between AI's demonstrated capabilities in narrow tasks like software engineering and its struggles with real-world problems that unfold over extended time periods.

Taylor noted that existing AI benchmarks often test systems in " static environments" that do not reflect the complexity and uncertainty of actual decision-making scenarios. He stated that while software engineering is economically important, "there are lots of other activities with longer time horizons that are important to look at." The paper has not yet undergone peer review.

Context

The study provides perspective on the gap between AI's recent advances in specific domains and its broader limitations. In 2024 and 2025, major AI systems demonstrated significant progress in code generation and software engineering tasks, leading to widespread speculation in Silicon Valley about AI's potential to automate white-collar work. However, soccer prediction presents a different challenge: it requires analyzing incomplete information, adapting to unexpected events like player injuries or trades, and managing uncertainty over a 38-match season.

Sports betting has historically been difficult even for professional analysts and quantitative models. The Premier League's competitive balance and the number of variables affecting match outcomes—team form, weather, referee decisions, and tactical adjustments—create a prediction problem that resists static analytical approaches. The fact that Claude Opus 4.6 limited its losses to 11 percent while Grok lost everything suggests that model architecture and training approach influence performance on real-world forecasting tasks.

What's Next

The KellyBench study will likely inform how researchers evaluate AI systems going forward, particularly in assessing whether high performance on standardized benchmarks translates to competence in dynamic, real-world environments. General Reasoning plans to publish the full peer-reviewed paper, which could shift how technology companies and investors assess AI's readiness for deployment in domains like financial trading, supply chain management, and strategic planning where long-term decision-making under uncertainty is essential. The results may also temper expectations around AI automation in roles requiring sustained judgment over extended periods.

Source

https://arstechnica.com/ai/2026/04/ai-models-are-terrible-at-betting-on-soccer-especially-xai-grok/

Premier League Betting Test Exposes Limits of Advanced AI Systems

Source

Never Miss a Signal