~ one prompt × fifteen models

Model performance

One prompt, fifteen models, simultaneous. All data from OpenRouter on 3 June 2026.

Model

Write code

Explain code

gemini-3.1-pro

0.6s0.70¢

3.1s1.02¢

claude-opus-4.8

1.0s0.43¢

5.3s1.06¢

kimi-k2.6

55.6s1.01¢

21.1s0.33¢

glm-5.1

24.8s0.50¢

12.2s0.37¢

qwen3.7-plus

16.7s0.16¢

23.2s0.22¢

minimax-m3

2.2s0.03¢

3.8s0.04¢

mercury-2

0.4s0.03¢

0.3s0.02¢

mistral-small-4

0.4s0.009¢

0.7s0.01¢

deepseek-v4-flash

4.3s0.01¢

2.0s0.006¢

gpt-5.5

2.6sBYOK

3.7sBYOK

Task 1: write Stata code

Prompt

I have the following .dta file.

        Variable      Storage   Display    Value
    name         type    format    label      Variable label
id              long    %12.0g                Firm identifier
name            str10   %10s                  Firm name
revenue         double  %10.0g                Annual revenue
year            int     %8.0g                 Fiscal year
      

Write a Stata .do script that computes the average revenue by year. Just return the .do file, no commentary.

Output tokens

Wall-clock time

Tokens/sec

Total cost

anthropic/claude-sonnet-4.5

use "auto.dta", clear
collapse (mean) revenue, by(year)
list

56 0.7s 84 t/s 0.19¢

Concise. Correct. No unnecessary commands.

anthropic/claude-opus-4.8

* Compute average revenue by year
use "auto.dta", clear

collapse (mean) avg_revenue = revenue, by(year)

list year avg_revenue, clean noobs

67 1.0s 70 t/s 0.43¢

Most expensive model per token ($5/$25 per M). Clean output, renamed variable, clean noobs formatting. Comment despite "no commentary."

google/gemini-3.1-pro

use "auto.dta", clear

// Compute the average revenue by year
collapse (mean) revenue, by(year)

// Display the results
list

524 0.6s 924 t/s 0.70¢

Fastest frontier model (924 t/s). Correct but added comments. Most tokens spent on reasoning. Used // comments (valid Stata).

x-ai/grok-4.20

use "auto.dta", clear
tabstat revenue, by(year) statistics(mean)

2,226 16.3s 137 t/s 0.60¢

Used tabstat instead of collapse — displays averages without destroying the dataset. Valid alternative. 2,226 tokens mostly reasoning.

openai/gpt-5.5

clear all
use "auto.dta", clear
collapse (mean) avg_revenue=revenue, by(year)
list year avg_revenue, noobs

148 2.6s 57 t/s 0.00¢

Renamed output variable. Added noobs. Cost shows 0 — BYOK (own OpenAI key).

google/gemini-3.5-flash

* Load the dataset
use "auto.dta", clear
* Compute the average revenue by year
collapse (mean) revenue, by(year)
* List the results
list year revenue

954 2.5s 381 t/s 0.91¢

Correct but verbose. Added comments despite "no commentary" instruction. 954 tokens for 6 lines — most of it reasoning.

minimax/minimax-m3

clear all
use "auto.dta"
collapse (mean) avg_revenue=revenue, by(year)
list

123 2.2s 57 t/s 0.03¢

Clean. Missing , clear option on use.

qwen/qwen3.7-plus

clear
use "auto.dta", clear
collapse (mean) avg_revenue = revenue, by(year)
list, clean

929 16.7s 56 t/s 0.16¢

Correct. 929 tokens because of long reasoning chain. Redundant clear before use ..., clear.

inception/mercury-2

clear all
set more off
use "auto.dta", clear
preserve
collapse (mean) revenue, by(year)
list year revenue
save "avg_rev_by_year.dta", replace
restore

289 0.4s 710 t/s 0.03¢

Fastest by far (diffusion LLM). Added preserve/restore and saved output. Overzealous but not wrong.

deepseek/deepseek-v4-flash

clear all
set more off
use auto.dta
collapse (mean) revenue, by(year)
list

284 4.3s 65 t/s 0.01¢

Cheapest response. Missing quotes around filename. set more off unnecessary in modern Stata.

mistralai/mistral-small-4

* Compute average revenue by year

preserve

collapse (mean) avg_revenue=revenue, by(year)

save "avg_revenue_by_year.dta", replace

restore

display as result "Average revenue by year saved to 'avg_revenue_by_year.dta'"

exit

64 0.4s 156 t/s 0.009¢

Fastest cheap model (0.4s, $0.15/M input). Used preserve/restore, saved output to .dta, added display confirmation. No use — assumes data in memory. Added exit. Comment despite "no commentary."

google/gemma-4-26b-a4b-it

use "auto.dta", clear
statsby avg_revenue = r(mean), by(year): summarize revenue
list

58 1.5s 38 t/s 0.005¢

Used statsby instead of collapse. Valid but unusual. Cheapest model at $0.06/M input.

moonshotai/kimi-k2.6

collapse (mean) revenue, by(year)

2,438 55.6s 44 t/s 1.01¢

One line of output after 2,400 tokens of reasoning. Slowest and most expensive. The reasoning was thorough but the output omits use.

z-ai/glm-5.1

use "auto.dta", clear
collapse (mean) revenue, by(year)
list

1,072 24.8s 43 t/s 0.50¢

Correct output. Long reasoning chain inflated token count and cost.

liquid/lfm-2.5-1.2b-instruct:free

Error: Invalid request body

— 0.9s — free

1.2B parameter model refused the request. You get what you pay for.

Reverse task: explain this code

The same models, but now reading code instead of writing it. A comprehension task tests different capabilities.

Prompt

What does this Stata code do? Explain in at most 100 words.

        * Benchmark: collapse on 10M rows
* Tests: CSV read, filter, generate, collapse with by()
use "test/data/large.csv", clear
keep if year >= 2020
generate profit = revenue - cost
collapse (mean) avg_revenue = revenue avg_profit = profit ///
    (sum) total_revenue = revenue (count) n = id, by(sector year)
sort sector year
      

Output tokens

Wall-clock time

Tokens/sec

Total cost

anthropic/claude-opus-4.8

Loads data, filters to year ≥ 2020, creates profit variable, aggregates by sector and year (mean revenue/profit, sum revenue, count), then sorts.

Note: use expects a .dta file; reading an actual .csv would require import delimited.

344 5.3s 65 t/s 1.06¢

The only model that caught the bug: use cannot read .csv files. This is the kind of domain knowledge that justifies a frontier model.

google/gemini-3.1-pro

Loads a large dataset. Filters to year ≥ 2020. Calculates profit. Aggregates by sector and year — computing average revenue/profit, total revenue, and count. Sorts the result.

803 3.1s 258 t/s 1.02¢

Correct, verbose. Did not catch the .csv bug. Most tokens spent on reasoning.

openai/gpt-5.5

Loads large.csv, keeps year ≥ 2020, creates profit as revenue − cost, aggregates by sector and year. Computes mean revenue/profit, total revenue, counts nonmissing id. Sorts result. Comments indicate it benchmarks these operations on 10M rows.

126 3.7s 34 t/s 0.00¢

Clean, concise, correct. Mentioned the benchmark context from the comments. Did not catch .csv bug. BYOK.

mistralai/mistral-small-4

Loads CSV with 10M rows. Keeps year ≥ 2020. Generates profit = revenue − cost. Collapses by sector and year: mean revenue/profit, sum revenue, count of observations. Sorts by sector and year. Output: aggregated dataset with summary stats per sector-year.

159 0.7s 228 t/s 0.01¢

Fast, cheap, correct. Mentioned "output is aggregated dataset" — good framing. 0.7s wall clock.

inception/mercury-2

Load data → filter year ≥ 2020 → create variable → collapse to one record per sector-year (mean, sum, count) → sort. Discards existing dataset.

248 0.3s 800 t/s 0.02¢

Fastest again: 800 t/s, 0.3s wall clock. Verbose formatted output for a short explanation.

deepseek/deepseek-v4-flash

Reads large.csv, filters to year ≥ 2020, creates profit as revenue − cost, collapses data by sector and year — computing mean revenue/profit, total revenue, and observation count per group. Sorts results.

89 2.0s 45 t/s 0.006¢

Shortest and cheapest. 89 tokens, 0.006¢. Perfectly adequate explanation.

xiaomi/mimo-v2.5

Benchmarks performance on 10M rows. Loads CSV, filters year ≥ 2020, computes profit. Aggregates by sector and year: mean revenue/profit, total revenue, count of observations. Sorts result.

332 3.6s 93 t/s 0.01¢

Clean numbered list. Xiaomi’s model at $0.14/M input — cheaper than most frontier models.

stepfun/step-3.7-flash

Loads large.csv, keeps observations with year ≥ 2020, creates profit as revenue minus cost. Collapses data by sector and year, computing mean revenue and profit, total revenue, and count of ids. Sorts by sector and year. Tests performance of filtering, variable generation, and grouped aggregation on ~10 million rows.

957 6.8s 140 t/s 0.12¢

Most reasoning tokens (957) for a 50-word output. Good explanation though — mentioned the benchmark context.

qwen/qwen3.7-plus

Processes large CSV in five steps: loads data, filters year ≥ 2020, generates profit variable, aggregates by sector and year via collapse (mean, sum, count), sorts result. (68 words)

1,283 23.2s 55 t/s 0.22¢

1,283 tokens for a 68-word answer. Included "(Word count: 68 words)" in its output. Spent 1,200 tokens reasoning about word count.

moonshotai/kimi-k2.6

Benchmarks processing a 10-million-row CSV. Loads data, keeps rows from 2020 onward, calculates profit as revenue minus cost, then aggregates by sector and year — computing average revenue and profit, total revenue, and row count. Sorts by sector and year.

891 21.1s 42 t/s 0.33¢

45-word answer after 891 tokens of reasoning. Slow (21s) but correct.

z-ai/glm-5.1

Benchmarks performance on 10M rows. Reads CSV, filters year ≥ 2020, generates profit variable, collapses by sector and year (mean revenue/profit, total revenue, ID counts), sorts results.

755 12.2s 62 t/s 0.37¢

Correct. Long reasoning chain. Used LaTeX math notation in output ($\ge$ 2020).

minimax/minimax-m3

Benchmarks data processing on 10M-row CSV. use loads data. keep filters to recent years. generate creates derived column. collapse aggregates per sector-year: mean revenue/profit, total revenue, count of IDs. sort orders result.

282 3.8s 75 t/s 0.04¢

Good structure. Referenced each Stata command by name. Clean and cheap.

x-ai/grok-code-fast-1

Error: Model deprecated. xAI recommends Grok 4.3.

— 0.4s — —

Deprecated model. Models come and go — another reason not to lock in.

What this shows

Every model got both tasks right — writing and reading Stata code. From Opus 4.8 to DeepSeek V4 Flash, the core competence is commoditized.
Only Opus caught the bug — the prompt uses use to load a .csv file, which is wrong in Stata (should be import delimited). Only Opus 4.8 flagged this. Domain expertise still differentiates frontier models.
Cost varies 200x for the same answer — DeepSeek V4 Flash cost 0.006¢ for the explanation task. Opus cost 1.06¢. For routine comprehension, the cheap model is indistinguishable.
Reasoning tokens often waste money — Qwen spent 1,200 tokens reasoning about word count to produce a 68-word answer. Step 3.7 Flash used 957 tokens for 50 words. Sonnet 4.5 spent 56 tokens and got the writing task right. More thinking is not better thinking.