Code Arena | WebDev

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Feb 12, 2026
151,146 votes
41 models
Rank Spread
1
12
Anthropic
Anthropic · Proprietary
1567+17/-17
1,625
2
12
Anthropic
Anthropic · Proprietary
1560+15/-15
2,113
3
33
Anthropic
1503+8/-8
9,892
4
47
OpenAI · Proprietary
1473+16/-16
1,691
5
46
Anthropic
Anthropic · Proprietary
1469+8/-8
10,054
6
411
Z.ai · MIT
1449+16/-16
1,643
7
610
Google · Proprietary
1449+8/-8
16,009
8
511
MoonshotAI
Moonshot · Modified MIT
1447+12/-12
2,916
9
611
Google · Proprietary
1444+8/-8
11,623
10
611
Z.ai · MIT
1442+10/-10
5,130
11
714
MoonshotAI
Moonshot · Modified MIT
1423+15/-15
1,880
12
1117
Minimax
MiniMax · MIT
1407+8/-8
8,867
13
1118
1404+9/-9
7,690
14
1120
OpenAI · Proprietary
1398+16/-16
1,633
15
1220
OpenAI · Proprietary
1395+12/-12
3,926
16
1220
Anthropic
Anthropic · Proprietary
1391+8/-8
8,979
17
1220
OpenAI · Proprietary
1390+9/-9
6,437
18
1320
Anthropic
1390+7/-7
13,158
19
1420
Anthropic
Anthropic · Proprietary
1386+7/-7
14,778
20
1421
DeepSeek · MIT
1375+10/-10
5,123
21
2023
Z.ai · MIT
1358+8/-8
8,744
22
2125
OpenAI · Proprietary
1348+7/-7
12,075
23
2126
1343+9/-9
5,960
24
2226
OpenAI · Proprietary
1338+10/-10
4,693
25
2226
MoonshotAI
Moonshot · Modified MIT
1336+7/-7
11,535
26
2328
OpenAI · Proprietary
1331+9/-9
6,502
27
2629
Minimax
MiniMax · Apache 2.0
1314+9/-9
8,832
28
2629
DeepSeek · MIT
1314+9/-9
6,408
29
2729
Anthropic
Anthropic · Proprietary
1307+7/-7
12,865
30
3031
DeepSeek · MIT
1289+10/-10
5,130
31
3031
Qwen Icon
Alibaba · Apache 2.0
1284+7/-7
12,607
32
3234
Kwai
KwaiKAT · Proprietary
1261+15/-15
1,954
33
3235
OpenAI · Proprietary
1245+17/-17
1,537
34
3235
xAI · Proprietary
1237+9/-9
7,167
35
3338
Mistral · Apache 2.0
1225+20/-20
1,037
36
3538
Google · Proprietary
1208+13/-13
3,453
37
3538
xAI · Proprietary
1206+19/-19
1,266
38
3538
Mistral · Modified MIT
1201+16/-16
1,681
39
3940
xAI · Proprietary
1155+22/-22
968
40
3941
xAI · Proprietary
1143+21/-21
1,016
41
4041
Mistral · Proprietary
1101+22/-22
1,021

Remove Style Control Leaderboard Plots

Confidence Intervals on Model Strength (via Bootstrapping)

Battle Count for Each Combination of Models (without Ties)

Fraction of Model A Wins for All Non-tied A vs. B Battles

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)