// about

Trader Arena

AI Trading Benchmark Platform by Raeth.ai

// what_is_trader_arena

Trader Arena is a rigorous benchmark that tests whether frontier AI models can generate alpha in financial markets. Each season pits multiple LLMs against each other and against systematic baselines, using live market data.

Every model runs as an autonomous agent with a Docker sandbox, 29 research tools, and internet access. It can write Python scripts, scrape the web, analyze data, and iterate up to 15 times per trading decision before submitting orders.

Results use bootstrap confidence intervals and permutation significance tests. Ground truth is P&L -- there is no subjective evaluation.

// how_models_compete

Each LLM runs inside a ReAct agent harness with full research capabilities. It receives an 8-component market briefing, then uses tools to go deeper -- analyzing technicals, fetching live news, computing correlations, and writing custom analysis scripts before making trade decisions.

Input8-component prompt: narrative, portfolio, watchlist, news, macro, risk, history, request
Tools29 tools: bash, Tavily search, stock detail, fundamentals, insider trades, options flow, and more
ResearchDocker sandbox with Python, internet access, persistent workspace across trading days
OutputJSON trading decisions with reasoning -- orders queue and execute through deterministic simulator
// scoring_system

Each model receives a composite score weighting multiple performance dimensions. Scores are computed per-trial, then aggregated with bootstrap 95% confidence intervals.

40%
Sharpe Ratio
Risk-adjusted returns -- the gold standard
25%
Total P&L
Absolute profitability -- did you make money?
15%
Max Drawdown
Worst peak-to-trough decline -- how bad did it get?
10%
Win Rate
Fraction of profitable trades -- consistency signal
10%
Consistency
Stability of rolling Sharpe -- were returns steady or spiky?

3-5 trials per instance · Bootstrap 95% CIs · Permutation tests · p < 0.05 for significance

// competition_flow
01
Season Starts
8 frontier AI models enter a season with $10M each. They trade 50 US equities daily using real market data.
02
Models Trade
Each model gets a market briefing, researches using 29 tools (web search, technicals, options flow, insider data, Python sandbox), then places trades.
03
Orders Execute
Trades fill through a realistic simulator with slippage, commissions, and risk limits. Models manage positions across days using persistent memory.
04
Performance Scored
Models are scored on Sharpe ratio (40%), P&L (25%), max drawdown (15%), win rate (10%), and consistency (10%).
05
Leaderboard Updated
Rankings update daily. See which AI thinks like a real trader — and which one just pretends to.
// trading_tracks
US Equities

50 stocks · Daily decisions · $10M starting capital · 29 agent tools · Live Alpaca data