📊 Full opportunity report: Week Three — Foundation model vs Brownian motion. Kronos on five-minute BTC. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

A recent study tested the open-source foundation model Kronos against a traditional Brownian motion model for 5-minute Bitcoin predictions. Results show Kronos performs statistically indistinguishably from Brownian, failing to demonstrate a clear edge. This questions the effectiveness of complex models in short-term crypto forecasting.

Recent testing of Kronos, an open-source foundation model trained on global crypto data, shows it does not outperform a traditional Brownian motion model in predicting 5-minute Bitcoin price movements. This finding challenges assumptions that modern, learned models automatically provide better forecasts for short-term trading.

Over two weeks, researchers compared Kronos-small, a 24.7 million parameter model, against a geometric Brownian motion baseline in predicting whether Bitcoin would close above its open price within five minutes. The test used 497 historical trades and evaluated model performance using Brier scores, log-loss, and hypothetical profit metrics.

The results showed that Kronos’s predictive accuracy was statistically indistinguishable from the Brownian baseline. Specifically, on out-of-sample data, the Brier scores for both models were nearly identical, and the difference was within the noise margin of repeated tests. The market-implied probabilities from Polymarket’s order book sat between the two, indicating that the market’s own calibration was comparable to both models.

Despite expectations that a learned model trained on millions of candles would outperform a 100-year-old mathematical assumption, the study found no evidence of Kronos providing a trading edge over the simple Brownian model for this specific short-term horizon. As a result, the authors concluded that deploying Kronos into a live trading bot would not currently improve performance.

Polybot Week 3 — Kronos vs Brownian — Thorsten Meyer AI
KRONOS
● RESEARCH SERIES / MAY 2026
THORSTEN MEYER AI · POLYBOT · WEEK 3
POLYBOT · WEEK 3
KRONOS vs BROWNIAN
Research Series · Foundation Model vs Classical Baseline · 2026-05-17

Foundation model
vs Brownian motion.
Kronos on five-minute BTC.

A modern learned model just lost to math from 1900. On 497 paired trades. Stage 2 is not happening.
Polybot’s fair-value strategy uses a 1900s geometric Brownian model to price 5-minute BTC outcomes. The natural follow-up after two weeks of negative parametric results: would a modern learned model trained on millions of real candles do better? The credible candidate: Kronos — open-source MIT-licensed foundation model, 25,000+ GitHub stars, AAAI 2026, four sizes from 4M to 499M parameters, trained on candles from 45 global exchanges. Test design: 497 paired (FILL→SETTLE) trades, Brownian baseline reconstructed line-for-line, Kronos-small (24.7M params) sampled with 16 forecast paths, scored on Brier + log-loss + hypothetical P&L, chronologically split for out-of-sample discipline. On 249 out-of-sample trades: Brownian 0.188 Brier vs Kronos 0.189 Brier. Gap 0.0011. Statistically indistinguishable. Stage 2 is not happening. But the paradox is more interesting than the verdict: when used as a directional signal Kronos fires 28% less often and wins 60.7% vs Brownian’s 49.1% — slightly better trader on hypothetical P&L, even while systematically over-confident in the tails (predicts 2.4% chance → actual 20.4% win; predicts 84% → actual 69.6%). The negative result is the answer. The methodology is what gets published.
This is not financial advice. Nothing in this article should be used to inform real trading decisions. The bot trades simulated money. If you build something like it and run it with real funds, the most likely outcome — by a wide margin — is that you lose those funds. That holds whether you use a Brownian model, a 100-million-parameter foundation model, or any other forecaster.
497
Paired (FILL→SETTLE) trades
all BTC · 5-min Up/Down markets
0.0011
Out-of-sample Brier-score gap
249 trades · statistically indistinguishable
Kronos log-loss vs Brownian
signature of confident wrong predictions
+$538 / +$465
Hypothetical Kronos vs Brownian P&L
the paradox · 60.7% vs 49.1% win rates
POLYBOT WEEK 3· KRONOS-SMALL · 24.7M PARAMS· BROWNIAN BASELINE· 497 PAIRED TRADES · BTC· POLYMARKET 5-MIN UP/DOWN· BRIER 0.193 / 0.211 / 0.213· LOG-LOSS 0.567 / 0.604 / 1.080· OUT-OF-SAMPLE 0.188 vs 0.189· GAP 0.0011 · INDISTINGUISHABLE· STAGE 2 NOT HAPPENING· KRONOS BETTER TRADER · WORSE FORECASTER· 60.7% vs 49.1% WIN RATE· TAILS: 2.4% → 20.4% · 84% → 69.6%· POLYBOT MIT· KRONOS MIT· AAAI 2026 PAPER · 25K+ STARS· 11 MIN MAC M-SERIES · MPS BACKEND· 1,300 LINES OF PYTHON· RESEARCH_PIPELINE.MD PUBLIC· SAME GAUNTLET · DIFFERENT MODEL· POLYBOT WEEK 3· KRONOS-SMALL · 24.7M PARAMS· BROWNIAN BASELINE· 497 PAIRED TRADES · BTC· POLYMARKET 5-MIN UP/DOWN· BRIER 0.193 / 0.211 / 0.213· LOG-LOSS 0.567 / 0.604 / 1.080· OUT-OF-SAMPLE 0.188 vs 0.189· GAP 0.0011 · INDISTINGUISHABLE· STAGE 2 NOT HAPPENING· KRONOS BETTER TRADER · WORSE FORECASTER· 60.7% vs 49.1% WIN RATE· TAILS: 2.4% → 20.4% · 84% → 69.6%· POLYBOT MIT· KRONOS MIT· AAAI 2026 PAPER · 25K+ STARS· 11 MIN MAC M-SERIES · MPS BACKEND· 1,300 LINES OF PYTHON· RESEARCH_PIPELINE.MD PUBLIC· SAME GAUNTLET · DIFFERENT MODEL·
FIG. 01 — THE TEST PIPELINE
Five steps · for every paired (FILL → SETTLE) trade in the running session
~1,300 lines of Python · 11 minutes on Mac M-series with PyTorch MPS · methodology public, specific numbers local
1
Reconstruct OHLCV context of the 60 minutes leading up to fire-time. Pull from the bot’s local Binance recording where available; fall back to Binance’s public klines API otherwise. Cache to parquet so re-runs cost nothing.
2
Recompute the Brownian baseline in Python — a line-for-line port of the bot’s own fairValuePUp(spot, openPrice, secondsLeftFrac, windowVol) formula. Matches scipy.stats.norm.cdf to three decimal places.
3
Read off the market-implied probability from the FILL price — what Polymarket’s order book thought the side was worth at the moment of fire. The market’s view as a reference point.
4
Run Kronos-small (24.7M parameters) on the OHLCV context · sample 16 forecast paths to the window’s end · count the fraction in which the underlying closes above the open price. That fraction is Kronos’s predicted p(Up).
5
Record (p_brownian, p_market, p_kronos, actual_outcome, P&L). Score on Brier + log-loss + hypothetical P&L. Sort chronologically · split into first/second half · report on both halves separately.
The discipline that matters: if a model wins on the first half but ties or loses on the second, that’s the curve-fit-in-slow-motion pattern the previous two articles named, and it doesn’t count as edge. The whole pipeline is reproducible from docs/RESEARCH_PIPELINE.md. Any future candidate model gets a sibling directory in research//, reuses the same Brownian baseline, the same trade-log loader, the same OHLCV fetcher, the same metrics, the same out-of-sample split. Same gauntlet, different model, same discipline.
FIG. 02 — FULL-SAMPLE SCORING · 497 PAIRED TRADES
Three models · two probability-scoring metrics
Brier score and log-loss · the standard scoring rules for probability forecasts · lower is better
Model
Brier ↓
Log-loss ↓
BrownianGeometric Brownian motion · the 1900s baseline
0.193
0.567
Market-impliedPolymarket order book at FILL · reference
0.211
0.604
Kronos24.7M-param foundation model · 16 sampled forecast paths
0.213
1.080
Kronos’s log-loss is roughly twice Brownian’s — the signature of a model that makes confident, wrong predictions in the tails. Polymarket’s order book sits between the two, reasonably calibrated, slightly worse than the bot’s Brownian and slightly better than the foundation model. The 100-year-old math beat the 24.7M-parameter foundation model on both probability-scoring metrics.
FIG. 03 — OUT-OF-SAMPLE VERDICT · 249-TRADE TEST HALF
Chronologically-separated · never seen by tuning
The verdict the test was designed to deliver · noise band of repeated runs with different sampling seeds
Brownian · 249-trade test half
0.188
Brier score (out-of-sample)
lower is better
Kronos · 249-trade test half
0.189
Brier score (out-of-sample)
lower is better
The gap
0.0011
Statistically indistinguishable
inside the noise band
Kronos does not beat Brownian on a held-out chronologically-separated sample. So Stage 2 is not happening.
“Stage 2” was the planned next step: wiring Kronos into Polybot as a live strategy if Stage 1 produced a clear signal. The case is not earned by this data. For 5-minute BTC at the horizons the bot trades, the open Kronos-small checkpoint does not. Stop. The next candidate model — Chronos · TimesFM · Lag-Llama · a Kronos finetune on 5-min crypto · something else — goes through the same gauntlet. Most will fail it. That is the gauntlet doing its job.
FIG. 04 — THE PARADOX · BETTER TRADER vs WORSE FORECASTER
By operational standards Kronos wins · by probabilistic standards Kronos loses
The hypothetical-P&L counterfactual replays the same data through “what if Polybot fired on each model’s probability”
Operational view · Kronos as the better trader
Kronos fires less · wins more · nets slightly more.
Hypothetical fires
201
Brownian fires (reference)
279
Win rate (Kronos)
60.7%
Win rate (Brownian)
49.1%
Hypothetical net P&L (Kronos)
+$538
Hypothetical net P&L (Brownian)
+$465
Fires ~28% less often and wins more reliably when it does. If you use Kronos as a directional signal in a broader system that does its own sizing — closer to how TradingAgents uses analyst outputs — the directional accuracy might still be useful.
Probabilistic view · Kronos as the worse forecaster
Systematically over-confident in the tails.
Kronos predicts
2.4%
Trades actually win
20.4%
Kronos predicts
84%
Trades actually win
69.6%
Log-loss vs Brownian
~2× worse
Brier (full sample)
0.213 vs 0.193
If you are building a fully-probabilistic system where the probability feeds an expected-value calculation against the market’s implied price — which is what Polybot does — calibration is everything, and Kronos’s calibration is bad enough to disqualify it. It thinks it knows more than it does at both ends.
Both interpretations are honest. Neither earns the model a place in Polybot. One of them might earn it a place, later, in TradingAgents — as a 5th analyst voice that votes on direction without being trusted for calibrated odds. That experiment is not what this week tested; it is a separate hypothesis for a separate week.
FIG. 05 — WEEK FOUR · THREE POSSIBLE THREADS
Each is a separate article · the pattern across them is the same
Honest measurement · out-of-sample discipline · no rescue narratives when something doesn’t work
1
A second-tier candidate model · Amazon’s Chronos
Same general shape as Kronos · different training corpus · also open-source. Running it through the exact same gauntlet would say whether the negative result is specific to Kronos or generalises to learned models in this regime.
Generalisation test
2
Kronos with a finetune on 5-min crypto data
The Kronos repo ships a finetuning pipeline. Taking the open Kronos-base checkpoint, finetuning on the bot’s own recorded BTC tick history, re-testing. Isolates “is the pretrained distribution wrong for crypto?” from “is the architecture wrong for this horizon?”
Architecture vs distribution
3
A live-trading update on Polybot
The fleet has been running paper trades continuously across these three weeks. A fresh aggregate-P&L view, with the same calibration-style analysis applied to live performance rather than historical replay, is overdue.
Status reset
The contract is “same gauntlet, different model, same discipline.” Specific numbers stay local. Methodology is public on the repo’s docs/RESEARCH_PIPELINE.md. Publishing reproducible parameter recipes for strategies that might be marginally profitable encourages people to copy them with real money, and the prior on real-money outcomes when copying retail strategies is “they lose.” Publishing the methodology lets the next person test their own model honestly without inheriting any of mine.
By probabilistic standards · Kronos is a worse forecaster. By operational standards · Kronos is the better trader. Both interpretations are honest. Neither earns the model a place in Polybot. One of them might earn it a place, later, in TradingAgents.
Thorsten Meyer AI · Week 3 · Foundation Model vs Brownian Motion

Implications for Short-Term Crypto Forecasting

This study challenges the assumption that advanced machine learning models inherently outperform traditional stochastic models in short-term crypto prediction. The results suggest that, at least for 5-minute Bitcoin trades, simple models like Brownian motion remain competitive, raising questions about the added value of complex models in fast-paced markets.

For traders and developers, this indicates that investing in sophisticated models may not always yield better results, especially when market conditions are highly noisy and unpredictable at such granular timeframes. It also underscores the importance of rigorous out-of-sample testing before deploying models in live trading environments.

Bitcoin Merch - Mars Lander V2 Solo Bitcoin Miner with Compac A1- Up to 350GH/s

Bitcoin Merch – Mars Lander V2 Solo Bitcoin Miner with Compac A1- Up to 350GH/s

All-in-One Design: Integrates WiFi, RGB LEDs, and a live BTC price ticker for an enhanced mining experience.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Testing Modern Models Against Traditional Baselines

Historically, financial modeling has relied on assumptions like geometric Brownian motion to estimate price movements, dating back to early 20th-century mathematics. Recent advances have produced large-scale foundation models trained on vast datasets, promising improved forecasts. However, empirical validation in real trading scenarios remains limited.

This research builds on prior efforts to evaluate whether such models can provide a tangible edge in short-term trading, as discussed in Week Three — Foundation model vs Brownian motion. Previous experiments with various machine learning approaches have yielded mixed results, often limited by overfitting or market noise. The current test leverages a transparent, open-source methodology and a well-defined out-of-sample period to assess whether Kronos can outperform the Brownian baseline in a realistic trading setting.

The study also reflects ongoing debates about the practical utility of AI in high-frequency trading and whether complex models can generalize beyond in-sample data, especially in markets characterized by rapid, unpredictable fluctuations.

“Kronos does not outperform the Brownian baseline on out-of-sample data for 5-minute Bitcoin forecasts. The results are statistically indistinguishable, calling into question the added value of modern learned models at this horizon.”

— Thorsten Meyer, researcher

CRYPTOCURRENCY PRICE ANALYSIS, PREDICTION, AND FORECASTING USING MACHINE LEARNING WITH PYTHON

CRYPTOCURRENCY PRICE ANALYSIS, PREDICTION, AND FORECASTING USING MACHINE LEARNING WITH PYTHON

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Unclear Impact of Model Complexity on Short-Term Predictions

It remains uncertain whether different configurations of Kronos, larger model sizes, or alternative training data could yield better performance. Additionally, the results are specific to 5-minute Bitcoin trades and may not generalize to other assets or timeframes. Further research is needed to determine if learned models can outperform traditional stochastic models in different market conditions or longer horizons.

The No-BS Guide to AI for Trading & Market Research: How to Use ChatGPT, Claude & AI Tools for Market Analysis, Stock Research & Data-Driven Trading ... ... Required (The No-BS AI Playbooks Book 3)

The No-BS Guide to AI for Trading & Market Research: How to Use ChatGPT, Claude & AI Tools for Market Analysis, Stock Research & Data-Driven Trading … … Required (The No-BS AI Playbooks Book 3)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Next Steps in Model Evaluation and Market Testing

Researchers plan to test larger Kronos variants and explore different market conditions to assess whether model improvements can lead to better short-term forecasts. Additionally, further studies will examine other assets and longer horizons to evaluate the broader applicability of learned models. Traders and developers should interpret current findings as a reminder to rigorously validate models before deployment.

Investing with the Secret Indicators of the Wealthy: How to Know What Stocks (and Crypto) to Buy and When: Proven Technical Indicators for Stocks and ... ... and Sell (The Power of Investing Book 1)

Investing with the Secret Indicators of the Wealthy: How to Know What Stocks (and Crypto) to Buy and When: Proven Technical Indicators for Stocks and … … and Sell (The Power of Investing Book 1)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Does this mean machine learning models are useless for crypto trading?

Not necessarily. The study shows that, for 5-minute Bitcoin predictions, Kronos does not outperform a simple Brownian model. However, models may perform better over different timeframes, assets, or with different configurations. Rigorous testing is essential before relying on any model for trading.

Why did Kronos not outperform the Brownian baseline?

The study suggests that at this short horizon, the market’s noise level and unpredictability make complex models no more effective than simple stochastic assumptions. The models’ predictions were statistically similar in accuracy and risk metrics.

Could a different version of Kronos perform better?

It is possible that larger or differently trained versions of Kronos might yield improvements. Further testing with varied configurations and datasets is needed to explore this possibility.

What does this mean for traders using AI models?

Traders should be cautious and rely on thorough empirical validation. Complex models are not guaranteed to outperform simple baselines, especially over very short timeframes where market noise dominates.

Source: ThorstenMeyerAI.com

You May Also Like

The Skills Marketplace, Six Months Later: Predicted vs Actual

An analysis of the emergent skills marketplace six months after predictions, highlighting confirmed developments, structural surprises, and future outlook.

The license. Why the AI content market pays the brand-name corpus and strands the long tail.

Large publishers secure licensing deals with AI firms, leaving small publishers marginalized. This raises questions about market fairness and potential solutions.

The Bubble Question, Disentangled: 1999 vs 2026 Category by Category

A detailed comparison of the AI investment cycle in 2026 with the 1999 dotcom bubble, highlighting categories with bubble signals versus durable value.

Mistral. The fourth path.

Mistral raises $830M in 2026, becoming Europe’s leading venture-backed AI firm, but faces limitations compared to US frontiers on top-tier capabilities.