Model
Write code
Explain code
gemini-3.1-pro
0.6s0.70¢
3.1s1.02¢
claude-opus-4.8
1.0s0.43¢
5.3s1.06¢
kimi-k2.6
55.6s1.01¢
21.1s0.33¢
glm-5.1
24.8s0.50¢
12.2s0.37¢
qwen3.7-plus
16.7s0.16¢
23.2s0.22¢
minimax-m3
2.2s0.03¢
3.8s0.04¢
mercury-2
0.4s0.03¢
0.3s0.02¢
mistral-small-4
0.4s0.009¢
0.7s0.01¢
deepseek-v4-flash
4.3s0.01¢
2.0s0.006¢
gpt-5.5
2.6sBYOK
3.7sBYOK
Task 1: write Stata code
Prompt
I have the following .dta file.
Variable Storage Display Value
name type format label Variable label
id long %12.0g Firm identifier
name str10 %10s Firm name
revenue double %10.0g Annual revenue
year int %8.0g Fiscal year
Write a Stata .do script that computes the average revenue by year. Just return the .do file, no commentary.
Output tokens
Wall-clock time
Tokens/sec
Total cost
anthropic/claude-sonnet-4.5
use "auto.dta", clear
collapse (mean) revenue, by(year)
list
56
0.7s
84 t/s
0.19¢
Concise. Correct. No unnecessary commands.
anthropic/claude-opus-4.8
* Compute average revenue by year
use "auto.dta", clear
collapse (mean) avg_revenue = revenue, by(year)
list year avg_revenue, clean noobs
67
1.0s
70 t/s
0.43¢
Most expensive model per token ($5/$25 per M). Clean output, renamed variable,
clean noobs formatting. Comment despite "no commentary."
google/gemini-3.1-pro
use "auto.dta", clear
// Compute the average revenue by year
collapse (mean) revenue, by(year)
// Display the results
list
524
0.6s
924 t/s
0.70¢
Fastest frontier model (924 t/s). Correct but added comments. Most tokens spent on reasoning. Used
// comments (valid Stata).
x-ai/grok-4.20
use "auto.dta", clear
tabstat revenue, by(year) statistics(mean)
2,226
16.3s
137 t/s
0.60¢
Used
tabstat instead of collapse — displays averages without destroying the dataset. Valid alternative. 2,226 tokens mostly reasoning.
openai/gpt-5.5
clear all
use "auto.dta", clear
collapse (mean) avg_revenue=revenue, by(year)
list year avg_revenue, noobs
148
2.6s
57 t/s
0.00¢
Renamed output variable. Added
noobs. Cost shows 0 — BYOK (own OpenAI key).
google/gemini-3.5-flash
* Load the dataset
use "auto.dta", clear
* Compute the average revenue by year
collapse (mean) revenue, by(year)
* List the results
list year revenue
954
2.5s
381 t/s
0.91¢
Correct but verbose. Added comments despite "no commentary" instruction. 954 tokens for 6 lines — most of it reasoning.
minimax/minimax-m3
clear all
use "auto.dta"
collapse (mean) avg_revenue=revenue, by(year)
list
123
2.2s
57 t/s
0.03¢
Clean. Missing
, clear option on use.
qwen/qwen3.7-plus
clear
use "auto.dta", clear
collapse (mean) avg_revenue = revenue, by(year)
list, clean
929
16.7s
56 t/s
0.16¢
Correct. 929 tokens because of long reasoning chain. Redundant
clear before use ..., clear.
inception/mercury-2
clear all
set more off
use "auto.dta", clear
preserve
collapse (mean) revenue, by(year)
list year revenue
save "avg_rev_by_year.dta", replace
restore
289
0.4s
710 t/s
0.03¢
Fastest by far (diffusion LLM). Added
preserve/restore and saved output. Overzealous but not wrong.
deepseek/deepseek-v4-flash
clear all
set more off
use auto.dta
collapse (mean) revenue, by(year)
list
284
4.3s
65 t/s
0.01¢
Cheapest response. Missing quotes around filename.
set more off unnecessary in modern Stata.
mistralai/mistral-small-4
* Compute average revenue by year
preserve
collapse (mean) avg_revenue=revenue, by(year)
save "avg_revenue_by_year.dta", replace
restore
display as result "Average revenue by year saved to 'avg_revenue_by_year.dta'"
exit
64
0.4s
156 t/s
0.009¢
Fastest cheap model (0.4s, $0.15/M input). Used
preserve/restore, saved output to .dta, added display confirmation. No use — assumes data in memory. Added exit. Comment despite "no commentary."
google/gemma-4-26b-a4b-it
use "auto.dta", clear
statsby avg_revenue = r(mean), by(year): summarize revenue
list
58
1.5s
38 t/s
0.005¢
Used
statsby instead of collapse. Valid but unusual. Cheapest model at $0.06/M input.
moonshotai/kimi-k2.6
collapse (mean) revenue, by(year)
2,438
55.6s
44 t/s
1.01¢
One line of output after 2,400 tokens of reasoning. Slowest and most expensive. The reasoning was thorough but the output omits
use.
z-ai/glm-5.1
use "auto.dta", clear
collapse (mean) revenue, by(year)
list
1,072
24.8s
43 t/s
0.50¢
Correct output. Long reasoning chain inflated token count and cost.
liquid/lfm-2.5-1.2b-instruct:free
Error: Invalid request body
—
0.9s
—
free
1.2B parameter model refused the request. You get what you pay for.
Reverse task: explain this code
The same models, but now reading code instead of writing it. A comprehension task tests different capabilities.
Prompt
What does this Stata code do? Explain in at most 100 words.
* Benchmark: collapse on 10M rows
* Tests: CSV read, filter, generate, collapse with by()
use "test/data/large.csv", clear
keep if year >= 2020
generate profit = revenue - cost
collapse (mean) avg_revenue = revenue avg_profit = profit ///
(sum) total_revenue = revenue (count) n = id, by(sector year)
sort sector year
Output tokens
Wall-clock time
Tokens/sec
Total cost
anthropic/claude-opus-4.8
Loads data, filters to year ≥ 2020, creates profit variable, aggregates by sector and year (mean revenue/profit, sum revenue, count), then sorts.
Note: use expects a .dta file; reading an actual .csv would require import delimited.
344
5.3s
65 t/s
1.06¢
The only model that caught the bug:
use cannot read .csv files. This is the kind of domain knowledge that justifies a frontier model.
google/gemini-3.1-pro
Loads a large dataset. Filters to year ≥ 2020. Calculates profit. Aggregates by sector and year — computing average revenue/profit, total revenue, and count. Sorts the result.
803
3.1s
258 t/s
1.02¢
Correct, verbose. Did not catch the .csv bug. Most tokens spent on reasoning.
openai/gpt-5.5
Loads large.csv, keeps year ≥ 2020, creates profit as revenue − cost, aggregates by sector and year. Computes mean revenue/profit, total revenue, counts nonmissing id. Sorts result. Comments indicate it benchmarks these operations on 10M rows.
126
3.7s
34 t/s
0.00¢
Clean, concise, correct. Mentioned the benchmark context from the comments. Did not catch .csv bug. BYOK.
mistralai/mistral-small-4
Loads CSV with 10M rows. Keeps year ≥ 2020. Generates profit = revenue − cost. Collapses by sector and year: mean revenue/profit, sum revenue, count of observations. Sorts by sector and year. Output: aggregated dataset with summary stats per sector-year.
159
0.7s
228 t/s
0.01¢
Fast, cheap, correct. Mentioned "output is aggregated dataset" — good framing. 0.7s wall clock.
inception/mercury-2
Load data → filter year ≥ 2020 → create variable → collapse to one record per sector-year (mean, sum, count) → sort. Discards existing dataset.
248
0.3s
800 t/s
0.02¢
Fastest again: 800 t/s, 0.3s wall clock. Verbose formatted output for a short explanation.
deepseek/deepseek-v4-flash
Reads large.csv, filters to year ≥ 2020, creates profit as revenue − cost, collapses data by sector and year — computing mean revenue/profit, total revenue, and observation count per group. Sorts results.
89
2.0s
45 t/s
0.006¢
Shortest and cheapest. 89 tokens, 0.006¢. Perfectly adequate explanation.
xiaomi/mimo-v2.5
Benchmarks performance on 10M rows. Loads CSV, filters year ≥ 2020, computes profit. Aggregates by sector and year: mean revenue/profit, total revenue, count of observations. Sorts result.
332
3.6s
93 t/s
0.01¢
Clean numbered list. Xiaomi’s model at $0.14/M input — cheaper than most frontier models.
stepfun/step-3.7-flash
Loads large.csv, keeps observations with year ≥ 2020, creates profit as revenue minus cost. Collapses data by sector and year, computing mean revenue and profit, total revenue, and count of ids. Sorts by sector and year. Tests performance of filtering, variable generation, and grouped aggregation on ~10 million rows.
957
6.8s
140 t/s
0.12¢
Most reasoning tokens (957) for a 50-word output. Good explanation though — mentioned the benchmark context.
qwen/qwen3.7-plus
Processes large CSV in five steps: loads data, filters year ≥ 2020, generates profit variable, aggregates by sector and year via collapse (mean, sum, count), sorts result. (68 words)
1,283
23.2s
55 t/s
0.22¢
1,283 tokens for a 68-word answer. Included "(Word count: 68 words)" in its output. Spent 1,200 tokens reasoning about word count.
moonshotai/kimi-k2.6
Benchmarks processing a 10-million-row CSV. Loads data, keeps rows from 2020 onward, calculates profit as revenue minus cost, then aggregates by sector and year — computing average revenue and profit, total revenue, and row count. Sorts by sector and year.
891
21.1s
42 t/s
0.33¢
45-word answer after 891 tokens of reasoning. Slow (21s) but correct.
z-ai/glm-5.1
Benchmarks performance on 10M rows. Reads CSV, filters year ≥ 2020, generates profit variable, collapses by sector and year (mean revenue/profit, total revenue, ID counts), sorts results.
755
12.2s
62 t/s
0.37¢
Correct. Long reasoning chain. Used LaTeX math notation in output ($\ge$ 2020).
minimax/minimax-m3
Benchmarks data processing on 10M-row CSV. use loads data. keep filters to recent years. generate creates derived column. collapse aggregates per sector-year: mean revenue/profit, total revenue, count of IDs. sort orders result.
282
3.8s
75 t/s
0.04¢
Good structure. Referenced each Stata command by name. Clean and cheap.
x-ai/grok-code-fast-1
Error: Model deprecated. xAI recommends Grok 4.3.
—
0.4s
—
—
Deprecated model. Models come and go — another reason not to lock in.
What this shows
- Every model got both tasks right — writing and reading Stata code. From Opus 4.8 to DeepSeek V4 Flash, the core competence is commoditized.
- Only Opus caught the bug — the prompt uses
useto load a .csv file, which is wrong in Stata (should beimport delimited). Only Opus 4.8 flagged this. Domain expertise still differentiates frontier models. - Cost varies 200x for the same answer — DeepSeek V4 Flash cost 0.006¢ for the explanation task. Opus cost 1.06¢. For routine comprehension, the cheap model is indistinguishable.
- Reasoning tokens often waste money — Qwen spent 1,200 tokens reasoning about word count to produce a 68-word answer. Step 3.7 Flash used 957 tokens for 50 words. Sonnet 4.5 spent 56 tokens and got the writing task right. More thinking is not better thinking.