Kezdi.jl Documentation
Kezdi.jl is a Julia package that provides a Stata-like interface for data manipulation and analysis. It is designed to be easy to use for Stata users who are transitioning to Julia.[stata]
It imports and reexports CSV, DataFrames, FixedEffectModels, FreqTables, ReadStatTables, Statistics, and StatsBase. These packages are not covered in this documentation, but you can find more information by following the links.
Getting started
Kezdi.jl is currently in beta. We have more than 400 unit tests and a large code coverage. The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a GitHub issue.
If you would like to receive updates on the package, please star the repository on GitHub and sign up for email notifications here.
Installation
To install the package, run the following command in Julia's REPL:
using Pkg; Pkg.add("Kezdi")Every Kezdi.jl command is a macro that begins with @. These commands operate on a global DataFrame that is set using the setdf function. Alternatively, commands can be executed within a @with block that sets the DataFrame for the duration of the block.
Example
julia> using Kezdijulia> using RDatasetsjulia> df = dataset("datasets", "mtcars")32×12 DataFrame Row │ Model MPG Cyl Disp HP DRat WT QS ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 2.62 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 3 │ Datsun 710 22.8 4 108.0 93 3.85 2.32 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.76 3.46 7 │ Duster 360 14.3 8 360.0 245 3.21 3.57 8 │ Merc 240D 24.4 4 146.7 62 3.69 3.19 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 1.935 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 28 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 32 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78 5 columns and 17 rows omittedjulia> setdf(df)32×12 DataFrame Row │ Model MPG Cyl Disp HP DRat WT QS ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 2.62 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 3 │ Datsun 710 22.8 4 108.0 93 3.85 2.32 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.76 3.46 7 │ Duster 360 14.3 8 360.0 245 3.21 3.57 8 │ Merc 240D 24.4 4 146.7 62 3.69 3.19 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 1.935 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 28 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 32 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78 5 columns and 17 rows omittedjulia> @rename HP HorsepowerKezdi.jl> @rename HP Horsepower 32×12 DataFrame Row │ Model MPG Cyl Disp Horsepower DRat WT ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 Float6 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 2.62 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.87 3 │ Datsun 710 22.8 4 108.0 93 3.85 2.32 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.76 3.46 7 │ Duster 360 14.3 8 360.0 245 3.21 3.57 8 │ Merc 240D 24.4 4 146.7 62 3.69 3.19 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 1.93 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 28 │ Lotus Europa 30.4 4 95.1 113 3.77 1.51 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 32 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78 6 columns and 17 rows omittedjulia> @rename Disp DisplacementKezdi.jl> @rename Disp Displacement 32×12 DataFrame Row │ Model MPG Cyl Displacement Horsepower DRat W ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 F ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 3 │ Datsun 710 22.8 4 108.0 93 3.85 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.76 7 │ Duster 360 14.3 8 360.0 245 3.21 8 │ Merc 240D 24.4 4 146.7 62 3.69 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 28 │ Lotus Europa 30.4 4 95.1 113 3.77 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.54 32 │ Volvo 142E 21.4 4 121.0 109 4.11 6 columns and 17 rows omittedjulia> @rename WT WeightKezdi.jl> @rename WT Weight 32×12 DataFrame Row │ Model MPG Cyl Displacement Horsepower DRat W ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 F ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 3 │ Datsun 710 22.8 4 108.0 93 3.85 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.76 7 │ Duster 360 14.3 8 360.0 245 3.21 8 │ Merc 240D 24.4 4 146.7 62 3.69 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 28 │ Lotus Europa 30.4 4 95.1 113 3.77 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.54 32 │ Volvo 142E 21.4 4 121.0 109 4.11 6 columns and 17 rows omittedjulia> @rename Cyl CylindersKezdi.jl> @rename Cyl Cylinders 32×12 DataFrame Row │ Model MPG Cylinders Displacement Horsepower DRat ⋯ │ String31 Float64 Int64 Float64 Int64 Float6 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 3 │ Datsun 710 22.8 4 108.0 93 3.8 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.0 5 │ Hornet Sportabout 18.7 8 360.0 175 3.1 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.7 7 │ Duster 360 14.3 8 360.0 245 3.2 8 │ Merc 240D 24.4 4 146.7 62 3.6 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.0 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.4 28 │ Lotus Europa 30.4 4 95.1 113 3.7 29 │ Ford Pantera L 15.8 8 351.0 264 4.2 30 │ Ferrari Dino 19.7 6 145.0 175 3.6 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.5 32 │ Volvo 142E 21.4 4 121.0 109 4.1 7 columns and 17 rows omittedjulia> @tabulate GearKezdi.jl> @tabulate Gear 3-element Named Vector{Int64} Gear │ ──────┼─── 3 │ 15 4 │ 12 5 │ 5julia> @keep @if Gear == 4Kezdi.jl> @keep @if Gear == 4 12×12 DataFrame Row │ Model MPG Cylinders Displacement Horsepower DRat ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 3 │ Datsun 710 22.8 4 108.0 93 3.85 4 │ Merc 240D 24.4 4 146.7 62 3.69 5 │ Merc 230 22.8 4 140.8 95 3.92 ⋯ 6 │ Merc 280 19.2 6 167.6 123 3.92 7 │ Merc 280C 17.8 6 167.6 123 3.92 8 │ Fiat 128 32.4 4 78.7 66 4.08 9 │ Honda Civic 30.4 4 75.7 52 4.93 ⋯ 10 │ Toyota Corolla 33.9 4 71.1 65 4.22 11 │ Fiat X1-9 27.3 4 79.0 66 4.08 12 │ Volvo 142E 21.4 4 121.0 109 4.11 6 columns omittedjulia> @keep MPG Horsepower Weight Displacement CylindersKezdi.jl> @keep MPG Horsepower Weight Displacement Cylinders 12×5 DataFrame Row │ MPG Horsepower Weight Displacement Cylinders │ Float64 Int64 Float64 Float64 Int64 ─────┼─────────────────────────────────────────────────────── 1 │ 21.0 110 2.62 160.0 6 2 │ 21.0 110 2.875 160.0 6 3 │ 22.8 93 2.32 108.0 4 4 │ 24.4 62 3.19 146.7 4 5 │ 22.8 95 3.15 140.8 4 6 │ 19.2 123 3.44 167.6 6 7 │ 17.8 123 3.44 167.6 6 8 │ 32.4 66 2.2 78.7 4 9 │ 30.4 52 1.615 75.7 4 10 │ 33.9 65 1.835 71.1 4 11 │ 27.3 66 1.935 79.0 4 12 │ 21.4 109 2.78 121.0 4julia> @summarize MPGKezdi.jl> @summarize MPG Summarize MPG: N = 12 sum_w = 12.0 mean = 24.53333333333333 Var = 27.844242424242417 sd = 5.276764389684498 skewness = 0.6109081273366428 kurtosis = 2.054454265238661 sum = 294.4 min = 17.8 max = 33.9 p1 = 17.8 p5 = 17.94 p10 = 18.78 p25 = 21.0 p50 = 22.8 p75 = 28.85 p90 = 32.849999999999994 p95 = 33.75 p99 = 33.9julia> @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robustKezdi.jl> @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust FixedEffectModel ==================================================================================== Number of obs: 12 Converged: true dof (model): 3 dof (residuals): 7 R²: 0.919 R² adjusted: 0.872 F-statistic: 16.5436 P-value: 0.001 R² within: 0.837 Iterations: 1 ==================================================================================== Estimate Std. Error t-stat Pr(>|t|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────────────── log(Horsepower) -0.336986 0.0557811 -6.04122 0.0005 -0.468887 -0.205084 log(Weight) 0.17324 0.261239 0.663148 0.5285 -0.444491 0.790971 log(Displacement) -0.491497 0.239809 -2.04954 0.0796 -1.05855 0.0755604 ====================================================================================
Alternatively, you can use the @with block to avoid writing to a global DataFrame:
julia> renamed_df = @with df begin @rename HP Horsepower @rename Disp Displacement @rename WT Weight @rename Cyl Cylinders endKezdi.jl> @rename HP Horsepower Kezdi.jl> @rename Disp Displacement Kezdi.jl> @rename WT Weight Kezdi.jl> @rename Cyl Cylinders 32×12 DataFrame Row │ Model MPG Cylinders Displacement Horsepower DRat ⋯ │ String31 Float64 Int64 Float64 Int64 Float6 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 3 │ Datsun 710 22.8 4 108.0 93 3.8 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.0 5 │ Hornet Sportabout 18.7 8 360.0 175 3.1 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.7 7 │ Duster 360 14.3 8 360.0 245 3.2 8 │ Merc 240D 24.4 4 146.7 62 3.6 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.0 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.4 28 │ Lotus Europa 30.4 4 95.1 113 3.7 29 │ Ford Pantera L 15.8 8 351.0 264 4.2 30 │ Ferrari Dino 19.7 6 145.0 175 3.6 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.5 32 │ Volvo 142E 21.4 4 121.0 109 4.1 7 columns and 17 rows omittedjulia> @with renamed_df begin @tabulate Gear @keep @if Gear == 4 @keep MPG Horsepower Weight Displacement Cylinders @summarize MPG @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust endKezdi.jl> @tabulate Gear Kezdi.jl> @keep @if Gear == 4 Kezdi.jl> @keep MPG Horsepower Weight Displacement Cylinders Kezdi.jl> @summarize MPG Kezdi.jl> @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust FixedEffectModel ==================================================================================== Number of obs: 12 Converged: true dof (model): 3 dof (residuals): 7 R²: 0.919 R² adjusted: 0.872 F-statistic: 16.5436 P-value: 0.001 R² within: 0.837 Iterations: 1 ==================================================================================== Estimate Std. Error t-stat Pr(>|t|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────────────── log(Horsepower) -0.336986 0.0557811 -6.04122 0.0005 -0.468887 -0.205084 log(Weight) 0.17324 0.261239 0.663148 0.5285 -0.444491 0.790971 log(Displacement) -0.491497 0.239809 -2.04954 0.0796 -1.05855 0.0755604 ====================================================================================
Benefits of using Kezdi.jl
Free and open-source
Speed
| Command | Stata | Julia 2nd run | Speedup |
|---|---|---|---|
@generate | 230ms | 46ms | 5x |
@replace | 232ms | 32ms | 7x |
@egen | 5.00s | 0.37s | 13x |
@collapse | 0.94s | 0.28s | 3x |
@tabulate | 2.19s | 0.09s | 24x |
@summarize | 10.56s | 0.35s | 30x |
@regress | 0.85s | 0.14s | 6x |
See the benchmarking code for Stata and Kezdi.jl.
Use any Julia function
@generate logHP = log(Horsepower)Easily extendable with user-defined functions
The function can operate on individual elements,
get_make(text) = split(text, " ")[1]
@generate Make = get_make(Model)or on the entire column:
function geometric_mean(x::Vector)
n = length(x)
return exp(sum(log.(x)) / n)
end
@collapse geom_NPG = geometric_mean(MPG), by(Cylinders)Commands
Setting and inspecting the global DataFrame
Kezdi.setdf — Functionsetdf(df::Union{AbstractDataFrame, Nothing})Set the global data frame.
Kezdi.@use — Macro@use "filename.dta", [clear]Read the data from the file filename.dta and set it as the global data frame. If there is already a global data frame, @use will throw an error unless the clear option is provided
Kezdi.@save — Macro@save "filename.dta", [replace]Save the global data frame to the file filename.dta. If the file already exists, the replace option must be provided.
Kezdi.getdf — Functiongetdf() -> AbstractDataFrameReturn the global data frame.
Kezdi.@names — Macro@namesDisplay the names of the variables in the data frame.
Kezdi.@list — Macro@list [y1 y2...] [@if condition]Display the entire data frame or the rows for which the condition is true. If variable names are provided, only the variables in the list are displayed.
Kezdi.@head — Macro@head [n]Display the first n rows of the data frame. By default, n is 5.
Kezdi.@tail — Macro@tail [n]Display the last n rows of the data frame. By default, n is 5.
Kezdi.@clear — Macro@clearClears the global dataframe.
Kezdi.@describe — Macro@describe [y1] [y2]...Show the names and data types of columns of the data frame. If no variable names given, all are shown.
Filtering columns and rows
Kezdi.@keep — Macro@keep y1 y2 ... [@if condition]Keep only the variables y1, y2, etc. in df. If condition is provided, only the rows for which the condition is true are kept.
Kezdi.@drop — Macro@drop y1 y2 ...or @drop [@if condition]
Drop the variables y1, y2, etc. from df. If condition is provided, the rows for which the condition is true are dropped.
Modifying the data
Kezdi.@rename — Macro@rename oldname newnameRename the variable oldname to newname in the data frame.
Kezdi.@generate — Macro@generate y = expr [@if condition]Create a new variable y in df by evaluating expr. If condition is provided, the operation is executed only on rows for which the condition is true. When the condition is false, the variable will be missing.
Kezdi.@replace — Macro@replace y = expr [@if condition]Replace the values of y in df with the result of evaluating expr. If condition is provided, the operation is executed only on rows for which the condition is true. When the condition is false, the variable will be left unchanged.
Kezdi.@mvencode — Macro@mvencode y1 y2 [_all] ... [if condition], [mv(value)]Encode missing values in the variables y1, y2, etc. in the data frame. If condition is provided, the operation is executed only on rows for which the condition is true. If mv is provided, the missing values are encoded with the value value. By default value is missing making no changes on the dataframe. Using _all encodes all variables of the DataFrame.
Kezdi.@egen — Macro@egen y1 = expr1 y2 = expr2 ... [@if condition], [by(group1, group2, ...)]Generate new variables in df by evaluating expressions expr1, expr2, etc. If condition is provided, the operation is executed only on rows for which the condition is true. When the condition is false, the variables will be missing. If by is provided, the operation is executed by group.
Kezdi.@collapse — Macro@collapse y1 = expr1 y2 = expr2 ... [@if condition], [by(group1, group2, ...)]Collapse df by evaluating expressions expr1, expr2, etc. If condition is provided, the operation is executed only on rows for which the condition is true. If by is provided, the operation is executed by group.
Kezdi.@sort — Macro@sort y1 y2 ... , [desc]Sort the data frame by the variables y1, y2, etc. By default, the variables are sorted in ascending order. If desc is provided, the variables are sorted in descending order
Kezdi.@order — Macro@order y1 y2 ... , [desc] [last] [after=var] [before=var] [alphabetical]Reorder the variables y1, y2, etc. in the data frame. By default, the variables are ordered in the order they are listed. If desc is provided, the variables are ordered in descending order. If last is provided, the variables are moved to the end of the data frame. If after is provided, the variables are moved after the variable var. If before is provided, the variables are moved before the variable var. If alphabetical is provided, the variables are ordered alphabetically.
Kezdi.@reshape — Macro@reshape long y1 y2 ... i(varlist) j(var)
@reshape wide y1 y2 ... i(varlist) j(var)Reshape the data frame from wide to long or from long to wide format. The variables y1, y2, etc. are the variables to be reshaped. The i(var) and j(var) are the variables that define the row and column indices in the reshaped data frame.
The option i() may include multiple variables, like i(var1, var2, var3). The option j() must include only one variable.
Kezdi.@append — Macro@append "filename.dta" / @append dfAppend the data from the file filename.dta or df DataFrame to the global data frame. Columns that are not common filled with missing values.
Summarizing and analyzing data
Kezdi.@count — Macro@count [@if condition]Count the number of rows for which the condition is true. If condition is not provided, the total number of rows is counted.
Kezdi.@tabulate — Macro@tabulate y1 y2 ... [@if condition]Create a frequency table for the variables y1, y2, etc. in df. If condition is provided, the operation is executed only on rows for which the condition is true.
Kezdi.@summarize — Macro@summarize y [@if condition]Summarize the variable y in df. If condition is provided, the operation is executed only on rows for which the condition is true.
Kezdi.@regress — Macro@regress y x1 x2 ... [@if condition], [robust] [cluster(var1, var2, ...)]Estimate a regression model in df with dependent variable y and independent variables x1, x2, etc. If condition is provided, the operation is executed only on rows for which the condition is true. If robust is provided, robust standard errors are calculated. If cluster is provided, clustered standard errors are calculated.
The regression is limited to rows for which all variables are values. Missing values, infinity, and NaN are automatically excluded.
Use on another DataFrame
Kezdi.With.@with — Macro@with df begin
# do something with df
endThe @with macro is a convenience macro that allows you to set the current data frame and perform operations on it in a single block. The first argument is the data frame to set as the current data frame, and the second argument is a block of code to execute. The data frame is set as the current data frame for the duration of the block, and then restored to its previous value after the block is executed.
The macro returns the value of the last expression in the block.
Kezdi.With.@with! — Macro@with! df begin
# do something with df
endThe @with! macro is a convenience macro that allows you to set the current data frame and perform operations on it in a single block. The first argument is the data frame to set as the current data frame, and the second argument is a block of code to execute. The data frame is set as the current data frame for the duration of the block, and then restored to its previous value after the block is executed.
The macro does not have a return value, it overwrites the data frame directly.
Differences to standard Julia and DataFrames syntax
To maximize convenience for Stata users, Kezdi.jl has a number of differences to standard Julia and DataFrames syntax.
Everything is a macro
While there are a few convenience functions, most Kezdi.jl commands are macros that begin with @.
@tabulate GearComma is used for options
Due to this non-standard syntax, Kezdi.jl uses the comma to separate options.
@regress log(MPG) log(Horsepower), robustHere log(MPG) and log(Horsepower) are the dependent and independent variables, respectively, and robust is an option. Options may also have arguments, like
@regress log(MPG) log(Horsepower), cluster(Cylinders)Automatic variable name substitution
Column names of the data frame can be used directly in the commands without the need to prefix them with the data frame name or using a Symbol.
@generate logHP = log(Horsepower)Other data manipulation packages in Julia require column names to be passed as symbols or strings. Kezdi.jl does not require this, and it will not work if you try to use symbols or strings.
Julia reserved words, like begin, export, function and standard types like String, Int, Float64, etc., cannot be used as variable names in Kezdi.jl. If you have a column with a reserved word, rename it before passing it to Kezdi.jl.
If you want to avoid variable name substitution, you currently have two workarounds. One is to refer to the fully qualified name of the variable, including the module. The other is to define a constant function.
df = DataFrame(x = 1:2, y = 3:4)
x = 5
y() = 6
@with df begin
@generate x1 = x
@generate x2 = Main.x
@generate y1 = y
@generate y2 = y()
endresults in
2×6 DataFrame
Row │ x y x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 1 3 1 5 3 6
2 │ 2 4 2 5 4 6Automatic vectorization
All functions are automatically vectorized, so there is no need to use the . operator to broadcast functions over elements of a column.
@generate logHP = log(Horsepower)If you want to turn off automatic vectorization, use the ~ symbol:
@generate logHP = ~log(Horsepower)The exception is when the function operates on Vectors, in which case Kezdi.jl understands you want to apply the function to the entire column.
@collapse mean_HP = mean(Horsepower), by(Cylinders)If you need to apply a function to individual elements of a column, you need to vectorize it with adding . after the function name:
@generate words = split(Model, " ")
@generate n_words = length.(words)Here, words becomes a vector of vectors, where each element is a vector of words in the corresponding Model string. The function length. will operate on each cell in words, counting the number of words in each Model string. By contrast, length(words) would return the number of elements in the words vector, which is the number of rows in the DataFrame.
The @if condition
Almost every command can be followed by an @if condition that filters the data frame. The command will only be executed on the subset of rows for which the condition evaluates to true. The condition can use any combination of column names and functions.
@summarize MPG @if Horsepower > median(Horsepower)Autovectorization rules also apply to @if conditions. If you use a vector function, it will be evaluated on the entire column, before subseting the data frame. By contrast, vector functions in @generate or @collapse commands are evaluated on the subset of rows that satisfy the condition.
@generate HP_p75 = median(Horsepower) @if Horsepower > median(Horsepower)This code computes the median of horsepower values above the median, that is, the 75th percentile of the horsepower distribution. Of course, you can more easily do this calculation with @summarize:
s = @summarize Horsepower
s.p75Handling missing values
Kezdi.jl ignores missing values when aggregating over entire columns.
@with DataFrame(A = [1, 2, missing, 4]) begin
@collapse mean_A = mean(A)
endreturns mean_A = 2.33.
Other functions typically return missing if any of the values are missing. If a function does not accept missing values, Kezdi.jl will pass it through passmissing to handle missing values.
You can also manually check for missing values with the ismissing function.
@with DataFrame(x = [1, 2, missing, 4]) begin
@generate y = log(x)
endreturns
4×2 DataFrame
Row │ x y
│ Int64? Float64?
─────┼─────────────────────────
1 │ 1 0.0
2 │ 2 0.693147
3 │ missing missing
4 │ 4 1.38629The same will hold for Dates.year, even though this function does not accept missing values.
julia> @with DataFrame(x = [1, 2, missing, 4]) begin
@generate y = Dates.year(x)
end
4×2 DataFrame
Row │ x y
│ Int64? Int64?
─────┼──────────────────
1 │ 1 1
2 │ 2 1
3 │ missing missing
4 │ 4 1In @if conditions, missing is treated as false. This is expected behavior from users, because when they test for a condition, they expect it to be true, not missing.
@with DataFrame(x = [1, 2, missing, 4]) begin
@keep @if x <= 2
endreturns [1, 2].
Use cond instead of ternary operators
Ternary operators like x ? y : z are not vectorized in Julia. Instead, use the cond function, which provides the exact same functionality.
@with DataFrame(x = [1, 2, 3, 4]) begin
@generate y = cond(x <= 2, 1, 0)
endNote that you can achieve the same result with the more readable code
@with DataFrame(x = [1, 2, 3, 4]) begin
@generate y = 1 @if x <= 2
@replace y = 0 @if x > 2
endBecause cond is vectorized and vectorized functions ignore missing values, this may lead to unexpected behavior. Use @replace @if instead.
Row-count variables
The variable _n refers to the row number in the data frame, _N denotes the total number of rows. These can be used in @if conditions, as well.
@with DataFrame(A = [1, 2, 3, 4]) begin
@keep @if _n < 3
endDifferences to Stata syntax
All commands begin with @
To allow for Stata-like syntax, all commands begin with @. These are macros that rewrite your Kezdi.jl code to DataFrames.jl commands.
@tabulate Gear
@keep @if Gear == 4
@keep Model MPG Horsepower Weight Displacement Cylinders@if condition also begins with @
The @if condition is non-standard behavior in Julia, so it is also implemented as a macro.
@collapse has same syntax as @egen
Unlike Stata, where egen and collapse have different syntax, Kezdi.jl uses the same syntax for both commands.
@egen mean_HP = mean(Horsepower), by(Cylinders)
@collapse mean_HP = mean(Horsepower), by(Cylinders)Different function names
To maintain compatibility with Julia, we had to rename some functions. For example, count is called rowcount, missing is called ismissing, max is maximum, and min is minimum in Kezdi.jl.
Missing values
In Julia, the result of any operation involving a missing value is missing. The only exception is the ismissing function, which returns true if the value is missing and false otherwise. You cannot check for missing values with == missing.
For convenience, Kezdi.jl has special rules about Handling missing values. We also extended the ismissing function to work with multiple arguments.
@with DataFrame(x = [1, 2, missing, 4], y = [1, missing, 3, 4]) begin
@generate z = ismissing(x, y)
end
4×3 DataFrame
Row │ x y z
│ Int64? Int64? Bool
─────┼─────────────────────────
1 │ 1 1 false
2 │ 2 missing true
3 │ missing 3 true
4 │ 4 4 falseMissing is not greater than anything, so comparison with missing values will always return missing.
In @if conditions, missing is treated as false. This is expected behavior from users, because when they test for a condition, they expect it to be true, not missing.
@with DataFrame(x = [1, 2, missing, 4]) begin
@keep @if x <= 2
endreturns [1, 2].
Convenience functions
Kezdi.distinct — Functiondistinct(x::AbstractVector) = unique(x)Convenience function to get the distinct values of a vector.
Kezdi.rowcount — Functionrowcount(x::AbstractVector) = length(keep_only_values(x))Count the number of valid values in a vector.
Kezdi.keep_only_values — Functionkeep_only_values(x::AbstractVector) -> AbstractVectorReturn a vector with only the values of x, excluding any missingvalues,nothings,Infa andNaN`s.
Base.ismissing — Functionismissing(args...) -> BoolReturn true if any of the arguments is missing.
Kezdi.cond — Functioncond(x, y, z)Return y if x is true, otherwise return z. If x is a vector, the operation is vectorized. This function mimics x ? y : z, which cannot be vectorized.
Acknowledgements
Inspiration for the package came from Tidier.jl, a similar package launched by Karandeep Singh that provides a dplyr-like interface for Julia. Johannes Boehm has also developed a similar package, Douglass.jl.
The package is built on top of DataFrames.jl, FreqTables.jl and FixedEffectModels.jl. The @with function relies on Chain.jl by Julius Krumbiegel.
The package is named after Gabor Kezdi, a Hungarian economist who has made significant contributions to teaching data analysis.
- stataStata is a registered trademark of StataCorp LLC. Kezdi.jl is not affiliated with StataCorp LLC.