~ one prompt × fifteen models

Model performance

One prompt, fifteen models, simultaneous. All data from OpenRouter on 3 June 2026.

Model
Write code
Explain code
gemini-3.1-pro
0.6s0.70¢
3.1s1.02¢
claude-opus-4.8
1.0s0.43¢
5.3s1.06¢
kimi-k2.6
55.6s1.01¢
21.1s0.33¢
glm-5.1
24.8s0.50¢
12.2s0.37¢
qwen3.7-plus
16.7s0.16¢
23.2s0.22¢
minimax-m3
2.2s0.03¢
3.8s0.04¢
mercury-2
0.4s0.03¢
0.3s0.02¢
mistral-small-4
0.4s0.009¢
0.7s0.01¢
deepseek-v4-flash
4.3s0.01¢
2.0s0.006¢
gpt-5.5
2.6sBYOK
3.7sBYOK

Task 1: write Stata code

Prompt

I have the following .dta file.

Variable Storage Display Value name type format label Variable label id long %12.0g Firm identifier name str10 %10s Firm name revenue double %10.0g Annual revenue year int %8.0g Fiscal year

Write a Stata .do script that computes the average revenue by year. Just return the .do file, no commentary.

Output tokens
Wall-clock time
Tokens/sec
Total cost
anthropic/claude-sonnet-4.5
use "auto.dta", clear collapse (mean) revenue, by(year) list
56 0.7s 84 t/s 0.19¢
Concise. Correct. No unnecessary commands.
anthropic/claude-opus-4.8
* Compute average revenue by year use "auto.dta", clear collapse (mean) avg_revenue = revenue, by(year) list year avg_revenue, clean noobs
67 1.0s 70 t/s 0.43¢
Most expensive model per token ($5/$25 per M). Clean output, renamed variable, clean noobs formatting. Comment despite "no commentary."
google/gemini-3.1-pro
use "auto.dta", clear // Compute the average revenue by year collapse (mean) revenue, by(year) // Display the results list
524 0.6s 924 t/s 0.70¢
Fastest frontier model (924 t/s). Correct but added comments. Most tokens spent on reasoning. Used // comments (valid Stata).
x-ai/grok-4.20
use "auto.dta", clear tabstat revenue, by(year) statistics(mean)
2,226 16.3s 137 t/s 0.60¢
Used tabstat instead of collapse — displays averages without destroying the dataset. Valid alternative. 2,226 tokens mostly reasoning.
openai/gpt-5.5
clear all use "auto.dta", clear collapse (mean) avg_revenue=revenue, by(year) list year avg_revenue, noobs
148 2.6s 57 t/s 0.00¢
Renamed output variable. Added noobs. Cost shows 0 — BYOK (own OpenAI key).
google/gemini-3.5-flash
* Load the dataset use "auto.dta", clear * Compute the average revenue by year collapse (mean) revenue, by(year) * List the results list year revenue
954 2.5s 381 t/s 0.91¢
Correct but verbose. Added comments despite "no commentary" instruction. 954 tokens for 6 lines — most of it reasoning.
minimax/minimax-m3
clear all use "auto.dta" collapse (mean) avg_revenue=revenue, by(year) list
123 2.2s 57 t/s 0.03¢
Clean. Missing , clear option on use.
qwen/qwen3.7-plus
clear use "auto.dta", clear collapse (mean) avg_revenue = revenue, by(year) list, clean
929 16.7s 56 t/s 0.16¢
Correct. 929 tokens because of long reasoning chain. Redundant clear before use ..., clear.
inception/mercury-2
clear all set more off use "auto.dta", clear preserve collapse (mean) revenue, by(year) list year revenue save "avg_rev_by_year.dta", replace restore
289 0.4s 710 t/s 0.03¢
Fastest by far (diffusion LLM). Added preserve/restore and saved output. Overzealous but not wrong.
deepseek/deepseek-v4-flash
clear all set more off use auto.dta collapse (mean) revenue, by(year) list
284 4.3s 65 t/s 0.01¢
Cheapest response. Missing quotes around filename. set more off unnecessary in modern Stata.
mistralai/mistral-small-4
* Compute average revenue by year preserve collapse (mean) avg_revenue=revenue, by(year) save "avg_revenue_by_year.dta", replace restore display as result "Average revenue by year saved to 'avg_revenue_by_year.dta'" exit
64 0.4s 156 t/s 0.009¢
Fastest cheap model (0.4s, $0.15/M input). Used preserve/restore, saved output to .dta, added display confirmation. No use — assumes data in memory. Added exit. Comment despite "no commentary."
google/gemma-4-26b-a4b-it
use "auto.dta", clear statsby avg_revenue = r(mean), by(year): summarize revenue list
58 1.5s 38 t/s 0.005¢
Used statsby instead of collapse. Valid but unusual. Cheapest model at $0.06/M input.
moonshotai/kimi-k2.6
collapse (mean) revenue, by(year)
2,438 55.6s 44 t/s 1.01¢
One line of output after 2,400 tokens of reasoning. Slowest and most expensive. The reasoning was thorough but the output omits use.
z-ai/glm-5.1
use "auto.dta", clear collapse (mean) revenue, by(year) list
1,072 24.8s 43 t/s 0.50¢
Correct output. Long reasoning chain inflated token count and cost.
liquid/lfm-2.5-1.2b-instruct:free
Error: Invalid request body
0.9s free
1.2B parameter model refused the request. You get what you pay for.

Reverse task: explain this code

The same models, but now reading code instead of writing it. A comprehension task tests different capabilities.

Prompt

What does this Stata code do? Explain in at most 100 words.

* Benchmark: collapse on 10M rows * Tests: CSV read, filter, generate, collapse with by() use "test/data/large.csv", clear keep if year >= 2020 generate profit = revenue - cost collapse (mean) avg_revenue = revenue avg_profit = profit /// (sum) total_revenue = revenue (count) n = id, by(sector year) sort sector year
Output tokens
Wall-clock time
Tokens/sec
Total cost
anthropic/claude-opus-4.8
Loads data, filters to year ≥ 2020, creates profit variable, aggregates by sector and year (mean revenue/profit, sum revenue, count), then sorts. Note: use expects a .dta file; reading an actual .csv would require import delimited.
344 5.3s 65 t/s 1.06¢
The only model that caught the bug: use cannot read .csv files. This is the kind of domain knowledge that justifies a frontier model.
google/gemini-3.1-pro
Loads a large dataset. Filters to year ≥ 2020. Calculates profit. Aggregates by sector and year — computing average revenue/profit, total revenue, and count. Sorts the result.
803 3.1s 258 t/s 1.02¢
Correct, verbose. Did not catch the .csv bug. Most tokens spent on reasoning.
openai/gpt-5.5
Loads large.csv, keeps year ≥ 2020, creates profit as revenue − cost, aggregates by sector and year. Computes mean revenue/profit, total revenue, counts nonmissing id. Sorts result. Comments indicate it benchmarks these operations on 10M rows.
126 3.7s 34 t/s 0.00¢
Clean, concise, correct. Mentioned the benchmark context from the comments. Did not catch .csv bug. BYOK.
mistralai/mistral-small-4
Loads CSV with 10M rows. Keeps year ≥ 2020. Generates profit = revenue − cost. Collapses by sector and year: mean revenue/profit, sum revenue, count of observations. Sorts by sector and year. Output: aggregated dataset with summary stats per sector-year.
159 0.7s 228 t/s 0.01¢
Fast, cheap, correct. Mentioned "output is aggregated dataset" — good framing. 0.7s wall clock.
inception/mercury-2
Load data → filter year ≥ 2020 → create variable → collapse to one record per sector-year (mean, sum, count) → sort. Discards existing dataset.
248 0.3s 800 t/s 0.02¢
Fastest again: 800 t/s, 0.3s wall clock. Verbose formatted output for a short explanation.
deepseek/deepseek-v4-flash
Reads large.csv, filters to year ≥ 2020, creates profit as revenue − cost, collapses data by sector and year — computing mean revenue/profit, total revenue, and observation count per group. Sorts results.
89 2.0s 45 t/s 0.006¢
Shortest and cheapest. 89 tokens, 0.006¢. Perfectly adequate explanation.
xiaomi/mimo-v2.5
Benchmarks performance on 10M rows. Loads CSV, filters year ≥ 2020, computes profit. Aggregates by sector and year: mean revenue/profit, total revenue, count of observations. Sorts result.
332 3.6s 93 t/s 0.01¢
Clean numbered list. Xiaomi’s model at $0.14/M input — cheaper than most frontier models.
stepfun/step-3.7-flash
Loads large.csv, keeps observations with year ≥ 2020, creates profit as revenue minus cost. Collapses data by sector and year, computing mean revenue and profit, total revenue, and count of ids. Sorts by sector and year. Tests performance of filtering, variable generation, and grouped aggregation on ~10 million rows.
957 6.8s 140 t/s 0.12¢
Most reasoning tokens (957) for a 50-word output. Good explanation though — mentioned the benchmark context.
qwen/qwen3.7-plus
Processes large CSV in five steps: loads data, filters year ≥ 2020, generates profit variable, aggregates by sector and year via collapse (mean, sum, count), sorts result. (68 words)
1,283 23.2s 55 t/s 0.22¢
1,283 tokens for a 68-word answer. Included "(Word count: 68 words)" in its output. Spent 1,200 tokens reasoning about word count.
moonshotai/kimi-k2.6
Benchmarks processing a 10-million-row CSV. Loads data, keeps rows from 2020 onward, calculates profit as revenue minus cost, then aggregates by sector and year — computing average revenue and profit, total revenue, and row count. Sorts by sector and year.
891 21.1s 42 t/s 0.33¢
45-word answer after 891 tokens of reasoning. Slow (21s) but correct.
z-ai/glm-5.1
Benchmarks performance on 10M rows. Reads CSV, filters year ≥ 2020, generates profit variable, collapses by sector and year (mean revenue/profit, total revenue, ID counts), sorts results.
755 12.2s 62 t/s 0.37¢
Correct. Long reasoning chain. Used LaTeX math notation in output ($\ge$ 2020).
minimax/minimax-m3
Benchmarks data processing on 10M-row CSV. use loads data. keep filters to recent years. generate creates derived column. collapse aggregates per sector-year: mean revenue/profit, total revenue, count of IDs. sort orders result.
282 3.8s 75 t/s 0.04¢
Good structure. Referenced each Stata command by name. Clean and cheap.
x-ai/grok-code-fast-1
Error: Model deprecated. xAI recommends Grok 4.3.
0.4s
Deprecated model. Models come and go — another reason not to lock in.

What this shows

  1. Every model got both tasks right — writing and reading Stata code. From Opus 4.8 to DeepSeek V4 Flash, the core competence is commoditized.
  2. Only Opus caught the bug — the prompt uses use to load a .csv file, which is wrong in Stata (should be import delimited). Only Opus 4.8 flagged this. Domain expertise still differentiates frontier models.
  3. Cost varies 200x for the same answer — DeepSeek V4 Flash cost 0.006¢ for the explanation task. Opus cost 1.06¢. For routine comprehension, the cheap model is indistinguishable.
  4. Reasoning tokens often waste money — Qwen spent 1,200 tokens reasoning about word count to produce a 68-word answer. Step 3.7 Flash used 957 tokens for 50 words. Sonnet 4.5 spent 56 tokens and got the writing task right. More thinking is not better thinking.