Kezdi.jl Documentation

Kezdi.jl is a Julia package that provides a Stata-like interface for data manipulation and analysis. It is designed to be easy to use for Stata users who are transitioning to Julia.^[stata]

It imports and reexports CSV, DataFrames, FixedEffectModels, FreqTables, ReadStatTables, Statistics, and StatsBase. These packages are not covered in this documentation, but you can find more information by following the links.

Getting started

Kezdi.jl is in beta

Kezdi.jl is currently in beta. We have more than 400 unit tests and a large code coverage. The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a GitHub issue.

If you would like to receive updates on the package, please star the repository on GitHub and sign up for email notifications here.

Installation

To install the package, run the following command in Julia's REPL:

using Pkg; Pkg.add("Kezdi")

Every Kezdi.jl command is a macro that begins with @. These commands operate on a global DataFrame that is set using the setdf function. Alternatively, commands can be executed within a @with block that sets the DataFrame for the duration of the block.

Example

julia> using Kezdi
julia> using RDatasets
julia> df = dataset("datasets", "mtcars")32×12 DataFrame
 Row │ Model              MPG      Cyl    Disp     HP     DRat     WT       QS ⋯
     │ String31           Float64  Int64  Float64  Int64  Float64  Float64  Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0      6    160.0    110     3.9     2.62      ⋯
   2 │ Mazda RX4 Wag         21.0      6    160.0    110     3.9     2.875
   3 │ Datsun 710            22.8      4    108.0     93     3.85    2.32
   4 │ Hornet 4 Drive        21.4      6    258.0    110     3.08    3.215
   5 │ Hornet Sportabout     18.7      8    360.0    175     3.15    3.44      ⋯
   6 │ Valiant               18.1      6    225.0    105     2.76    3.46
   7 │ Duster 360            14.3      8    360.0    245     3.21    3.57
   8 │ Merc 240D             24.4      4    146.7     62     3.69    3.19
  ⋮  │         ⋮             ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋱
  26 │ Fiat X1-9             27.3      4     79.0     66     4.08    1.935     ⋯
  27 │ Porsche 914-2         26.0      4    120.3     91     4.43    2.14
  28 │ Lotus Europa          30.4      4     95.1    113     3.77    1.513
  29 │ Ford Pantera L        15.8      8    351.0    264     4.22    3.17
  30 │ Ferrari Dino          19.7      6    145.0    175     3.62    2.77      ⋯
  31 │ Maserati Bora         15.0      8    301.0    335     3.54    3.57
  32 │ Volvo 142E            21.4      4    121.0    109     4.11    2.78
                                                   5 columns and 17 rows omitted
julia> setdf(df)32×12 DataFrame
 Row │ Model              MPG      Cyl    Disp     HP     DRat     WT       QS ⋯
     │ String31           Float64  Int64  Float64  Int64  Float64  Float64  Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0      6    160.0    110     3.9     2.62      ⋯
   2 │ Mazda RX4 Wag         21.0      6    160.0    110     3.9     2.875
   3 │ Datsun 710            22.8      4    108.0     93     3.85    2.32
   4 │ Hornet 4 Drive        21.4      6    258.0    110     3.08    3.215
   5 │ Hornet Sportabout     18.7      8    360.0    175     3.15    3.44      ⋯
   6 │ Valiant               18.1      6    225.0    105     2.76    3.46
   7 │ Duster 360            14.3      8    360.0    245     3.21    3.57
   8 │ Merc 240D             24.4      4    146.7     62     3.69    3.19
  ⋮  │         ⋮             ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋱
  26 │ Fiat X1-9             27.3      4     79.0     66     4.08    1.935     ⋯
  27 │ Porsche 914-2         26.0      4    120.3     91     4.43    2.14
  28 │ Lotus Europa          30.4      4     95.1    113     3.77    1.513
  29 │ Ford Pantera L        15.8      8    351.0    264     4.22    3.17
  30 │ Ferrari Dino          19.7      6    145.0    175     3.62    2.77      ⋯
  31 │ Maserati Bora         15.0      8    301.0    335     3.54    3.57
  32 │ Volvo 142E            21.4      4    121.0    109     4.11    2.78
                                                   5 columns and 17 rows omitted
julia> @rename HP HorsepowerKezdi.jl> @rename HP Horsepower

32×12 DataFrame
 Row │ Model              MPG      Cyl    Disp     Horsepower  DRat     WT     ⋯
     │ String31           Float64  Int64  Float64  Int64       Float64  Float6 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0      6    160.0         110     3.9     2.62 ⋯
   2 │ Mazda RX4 Wag         21.0      6    160.0         110     3.9     2.87
   3 │ Datsun 710            22.8      4    108.0          93     3.85    2.32
   4 │ Hornet 4 Drive        21.4      6    258.0         110     3.08    3.21
   5 │ Hornet Sportabout     18.7      8    360.0         175     3.15    3.44 ⋯
   6 │ Valiant               18.1      6    225.0         105     2.76    3.46
   7 │ Duster 360            14.3      8    360.0         245     3.21    3.57
   8 │ Merc 240D             24.4      4    146.7          62     3.69    3.19
  ⋮  │         ⋮             ⋮       ⋮       ⋮         ⋮          ⋮        ⋮   ⋱
  26 │ Fiat X1-9             27.3      4     79.0          66     4.08    1.93 ⋯
  27 │ Porsche 914-2         26.0      4    120.3          91     4.43    2.14
  28 │ Lotus Europa          30.4      4     95.1         113     3.77    1.51
  29 │ Ford Pantera L        15.8      8    351.0         264     4.22    3.17
  30 │ Ferrari Dino          19.7      6    145.0         175     3.62    2.77 ⋯
  31 │ Maserati Bora         15.0      8    301.0         335     3.54    3.57
  32 │ Volvo 142E            21.4      4    121.0         109     4.11    2.78
                                                   6 columns and 17 rows omitted
julia> @rename Disp DisplacementKezdi.jl> @rename Disp Displacement

32×12 DataFrame
 Row │ Model              MPG      Cyl    Displacement  Horsepower  DRat     W ⋯
     │ String31           Float64  Int64  Float64       Int64       Float64  F ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0      6         160.0         110     3.9     ⋯
   2 │ Mazda RX4 Wag         21.0      6         160.0         110     3.9
   3 │ Datsun 710            22.8      4         108.0          93     3.85
   4 │ Hornet 4 Drive        21.4      6         258.0         110     3.08
   5 │ Hornet Sportabout     18.7      8         360.0         175     3.15    ⋯
   6 │ Valiant               18.1      6         225.0         105     2.76
   7 │ Duster 360            14.3      8         360.0         245     3.21
   8 │ Merc 240D             24.4      4         146.7          62     3.69
  ⋮  │         ⋮             ⋮       ⋮         ⋮            ⋮          ⋮       ⋱
  26 │ Fiat X1-9             27.3      4          79.0          66     4.08    ⋯
  27 │ Porsche 914-2         26.0      4         120.3          91     4.43
  28 │ Lotus Europa          30.4      4          95.1         113     3.77
  29 │ Ford Pantera L        15.8      8         351.0         264     4.22
  30 │ Ferrari Dino          19.7      6         145.0         175     3.62    ⋯
  31 │ Maserati Bora         15.0      8         301.0         335     3.54
  32 │ Volvo 142E            21.4      4         121.0         109     4.11
                                                   6 columns and 17 rows omitted
julia> @rename WT WeightKezdi.jl> @rename WT Weight

32×12 DataFrame
 Row │ Model              MPG      Cyl    Displacement  Horsepower  DRat     W ⋯
     │ String31           Float64  Int64  Float64       Int64       Float64  F ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0      6         160.0         110     3.9     ⋯
   2 │ Mazda RX4 Wag         21.0      6         160.0         110     3.9
   3 │ Datsun 710            22.8      4         108.0          93     3.85
   4 │ Hornet 4 Drive        21.4      6         258.0         110     3.08
   5 │ Hornet Sportabout     18.7      8         360.0         175     3.15    ⋯
   6 │ Valiant               18.1      6         225.0         105     2.76
   7 │ Duster 360            14.3      8         360.0         245     3.21
   8 │ Merc 240D             24.4      4         146.7          62     3.69
  ⋮  │         ⋮             ⋮       ⋮         ⋮            ⋮          ⋮       ⋱
  26 │ Fiat X1-9             27.3      4          79.0          66     4.08    ⋯
  27 │ Porsche 914-2         26.0      4         120.3          91     4.43
  28 │ Lotus Europa          30.4      4          95.1         113     3.77
  29 │ Ford Pantera L        15.8      8         351.0         264     4.22
  30 │ Ferrari Dino          19.7      6         145.0         175     3.62    ⋯
  31 │ Maserati Bora         15.0      8         301.0         335     3.54
  32 │ Volvo 142E            21.4      4         121.0         109     4.11
                                                   6 columns and 17 rows omitted
julia> @rename Cyl CylindersKezdi.jl> @rename Cyl Cylinders

32×12 DataFrame
 Row │ Model              MPG      Cylinders  Displacement  Horsepower  DRat   ⋯
     │ String31           Float64  Int64      Float64       Int64       Float6 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0          6         160.0         110     3.9 ⋯
   2 │ Mazda RX4 Wag         21.0          6         160.0         110     3.9
   3 │ Datsun 710            22.8          4         108.0          93     3.8
   4 │ Hornet 4 Drive        21.4          6         258.0         110     3.0
   5 │ Hornet Sportabout     18.7          8         360.0         175     3.1 ⋯
   6 │ Valiant               18.1          6         225.0         105     2.7
   7 │ Duster 360            14.3          8         360.0         245     3.2
   8 │ Merc 240D             24.4          4         146.7          62     3.6
  ⋮  │         ⋮             ⋮         ⋮           ⋮            ⋮          ⋮   ⋱
  26 │ Fiat X1-9             27.3          4          79.0          66     4.0 ⋯
  27 │ Porsche 914-2         26.0          4         120.3          91     4.4
  28 │ Lotus Europa          30.4          4          95.1         113     3.7
  29 │ Ford Pantera L        15.8          8         351.0         264     4.2
  30 │ Ferrari Dino          19.7          6         145.0         175     3.6 ⋯
  31 │ Maserati Bora         15.0          8         301.0         335     3.5
  32 │ Volvo 142E            21.4          4         121.0         109     4.1
                                                   7 columns and 17 rows omitted
julia> @tabulate GearKezdi.jl> @tabulate Gear

3-element Named Vector{Int64}
Gear  │
──────┼───
3     │ 15
4     │ 12
5     │  5
julia> @keep @if Gear == 4Kezdi.jl> @keep  @if Gear == 4

12×12 DataFrame
 Row │ Model           MPG      Cylinders  Displacement  Horsepower  DRat      ⋯
     │ String31        Float64  Int64      Float64       Int64       Float64   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4          21.0          6         160.0         110     3.9    ⋯
   2 │ Mazda RX4 Wag      21.0          6         160.0         110     3.9
   3 │ Datsun 710         22.8          4         108.0          93     3.85
   4 │ Merc 240D          24.4          4         146.7          62     3.69
   5 │ Merc 230           22.8          4         140.8          95     3.92   ⋯
   6 │ Merc 280           19.2          6         167.6         123     3.92
   7 │ Merc 280C          17.8          6         167.6         123     3.92
   8 │ Fiat 128           32.4          4          78.7          66     4.08
   9 │ Honda Civic        30.4          4          75.7          52     4.93   ⋯
  10 │ Toyota Corolla     33.9          4          71.1          65     4.22
  11 │ Fiat X1-9          27.3          4          79.0          66     4.08
  12 │ Volvo 142E         21.4          4         121.0         109     4.11
                                                               6 columns omitted
julia> @keep MPG Horsepower Weight Displacement CylindersKezdi.jl> @keep MPG Horsepower Weight Displacement Cylinders

12×5 DataFrame
 Row │ MPG      Horsepower  Weight   Displacement  Cylinders 
     │ Float64  Int64       Float64  Float64       Int64     
─────┼───────────────────────────────────────────────────────
   1 │    21.0         110    2.62          160.0          6
   2 │    21.0         110    2.875         160.0          6
   3 │    22.8          93    2.32          108.0          4
   4 │    24.4          62    3.19          146.7          4
   5 │    22.8          95    3.15          140.8          4
   6 │    19.2         123    3.44          167.6          6
   7 │    17.8         123    3.44          167.6          6
   8 │    32.4          66    2.2            78.7          4
   9 │    30.4          52    1.615          75.7          4
  10 │    33.9          65    1.835          71.1          4
  11 │    27.3          66    1.935          79.0          4
  12 │    21.4         109    2.78          121.0          4
julia> @summarize MPGKezdi.jl> @summarize MPG

Summarize MPG:
  N = 12
  sum_w = 12.0
  mean = 24.53333333333333
  Var = 27.844242424242417
  sd = 5.276764389684498
  skewness = 0.6109081273366428
  kurtosis = 2.054454265238661
  sum = 294.4
  min = 17.8
  max = 33.9
  p1 = 17.8
  p5 = 17.94
  p10 = 18.78
  p25 = 21.0
  p50 = 22.8
  p75 = 28.85
  p90 = 32.849999999999994
  p95 = 33.75
  p99 = 33.9
julia> @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robustKezdi.jl> @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust

                                  FixedEffectModel
====================================================================================
Number of obs:                         12  Converged:                           true
dof (model):                            3  dof (residuals):                        7
R²:                                 0.919  R² adjusted:                        0.872
F-statistic:                      16.5436  P-value:                            0.001
R² within:                          0.837  Iterations:                             1
====================================================================================
                    Estimate  Std. Error     t-stat  Pr(>|t|)  Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────────────────
log(Horsepower)    -0.336986   0.0557811  -6.04122     0.0005  -0.468887  -0.205084
log(Weight)         0.17324    0.261239    0.663148    0.5285  -0.444491   0.790971
log(Displacement)  -0.491497   0.239809   -2.04954     0.0796  -1.05855    0.0755604
====================================================================================

Alternatively, you can use the @with block to avoid writing to a global DataFrame:

julia> renamed_df = @with df begin
           @rename HP Horsepower
           @rename Disp Displacement
           @rename WT Weight
           @rename Cyl Cylinders
       endKezdi.jl> @rename HP Horsepower

Kezdi.jl> @rename Disp Displacement

Kezdi.jl> @rename WT Weight

Kezdi.jl> @rename Cyl Cylinders

32×12 DataFrame
 Row │ Model              MPG      Cylinders  Displacement  Horsepower  DRat   ⋯
     │ String31           Float64  Int64      Float64       Int64       Float6 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0          6         160.0         110     3.9 ⋯
   2 │ Mazda RX4 Wag         21.0          6         160.0         110     3.9
   3 │ Datsun 710            22.8          4         108.0          93     3.8
   4 │ Hornet 4 Drive        21.4          6         258.0         110     3.0
   5 │ Hornet Sportabout     18.7          8         360.0         175     3.1 ⋯
   6 │ Valiant               18.1          6         225.0         105     2.7
   7 │ Duster 360            14.3          8         360.0         245     3.2
   8 │ Merc 240D             24.4          4         146.7          62     3.6
  ⋮  │         ⋮             ⋮         ⋮           ⋮            ⋮          ⋮   ⋱
  26 │ Fiat X1-9             27.3          4          79.0          66     4.0 ⋯
  27 │ Porsche 914-2         26.0          4         120.3          91     4.4
  28 │ Lotus Europa          30.4          4          95.1         113     3.7
  29 │ Ford Pantera L        15.8          8         351.0         264     4.2
  30 │ Ferrari Dino          19.7          6         145.0         175     3.6 ⋯
  31 │ Maserati Bora         15.0          8         301.0         335     3.5
  32 │ Volvo 142E            21.4          4         121.0         109     4.1
                                                   7 columns and 17 rows omitted
julia> @with renamed_df begin
           @tabulate Gear
           @keep @if Gear == 4
           @keep MPG Horsepower Weight Displacement Cylinders
           @summarize MPG
           @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust
       endKezdi.jl> @tabulate Gear

Kezdi.jl> @keep  @if Gear == 4

Kezdi.jl> @keep MPG Horsepower Weight Displacement Cylinders

Kezdi.jl> @summarize MPG

Kezdi.jl> @regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust

                                  FixedEffectModel
====================================================================================
Number of obs:                         12  Converged:                           true
dof (model):                            3  dof (residuals):                        7
R²:                                 0.919  R² adjusted:                        0.872
F-statistic:                      16.5436  P-value:                            0.001
R² within:                          0.837  Iterations:                             1
====================================================================================
                    Estimate  Std. Error     t-stat  Pr(>|t|)  Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────────────────
log(Horsepower)    -0.336986   0.0557811  -6.04122     0.0005  -0.468887  -0.205084
log(Weight)         0.17324    0.261239    0.663148    0.5285  -0.444491   0.790971
log(Displacement)  -0.491497   0.239809   -2.04954     0.0796  -1.05855    0.0755604
====================================================================================

Benefits of using Kezdi.jl

Free and open-source

Speed

Command	Stata	Julia 2nd run	Speedup
`@generate`	230ms	46ms	5x
`@replace`	232ms	32ms	7x
`@egen`	5.00s	0.37s	13x
`@collapse`	0.94s	0.28s	3x
`@tabulate`	2.19s	0.09s	24x
`@summarize`	10.56s	0.35s	30x
`@regress`	0.85s	0.14s	6x

See the benchmarking code for Stata and Kezdi.jl.

Use any Julia function

@generate logHP = log(Horsepower)

Easily extendable with user-defined functions

The function can operate on individual elements,

get_make(text) = split(text, " ")[1]
@generate Make = get_make(Model)

or on the entire column:

function geometric_mean(x::Vector)
    n = length(x)
    return exp(sum(log.(x)) / n)
end
@collapse geom_NPG = geometric_mean(MPG), by(Cylinders)

Commands

Setting and inspecting the global DataFrame

Kezdi.setdf — Function

setdf(df::Union{AbstractDataFrame, Nothing})

Set the global data frame.