Kezdi.jl Documentation
Kezdi.jl is a Julia package that provides a Stata-like interface for data manipulation and analysis. It is designed to be easy to use for Stata users who are transitioning to Julia.[stata]
It imports and reexports CSV, DataFrames, FixedEffectModels, FreqTables, ReadStatTables, Statistics, and StatsBase. These packages are not covered in this documentation, but you can find more information by following the links.
Getting started
Kezdi.jl
is currently in beta. We have close to 300 unit tests and a large code coverage. The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a GitHub issue.
If you would like to receive updates on the package, please star the repository on GitHub and sign up for email notifications here.
Installation
To install the package, run the following command in Julia's REPL:
using Pkg; Pkg.add(url="https://github.com/codedthinking/Kezdi.jl")
Every Kezdi.jl command is a macro that begins with @
. These commands operate on a global DataFrame
that is set using the setdf
function. Alternatively, commands can be executed within a @with
block that sets the DataFrame
for the duration of the block.
Example
using Kezdi
using RDatasets
setdf(dataset("datasets", "mtcars"))
@rename HP Horsepower
@rename Disp Displacement
@rename WT Weight
@rename Cyl Cylinders
@tabulate Gear
@keep @if Gear == 4
@keep MPG Horsepower Weight Displacement Cylinders
@summarize MPG
@regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust
Alternatively, you can use the @with
block to avoid writing to a global DataFrame
:
using Kezdi
using RDatasets
df = dataset("datasets", "mtcars")
renamed_df = @with df begin
@rename HP Horsepower
@rename Disp Displacement
@rename WT Weight
@rename Cyl Cylinders
end
@with renamed_df begin
@tabulate Gear
@keep @if Gear == 4
@keep MPG Horsepower Weight Displacement Cylinders
@summarize MPG
@regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust
end
Benefits of using Kezdi.jl
Free and open-source
Speed
Command | Stata | Julia 1st run | Julia 2nd run | Speedup |
---|---|---|---|---|
@egen | 4.90s | 1.60s | 0.41s | 10x |
@collapse | 0.92s | 0.18s | 0.13s | 8x |
@tabulate | 2.14s | 0.46s | 0.10s | 20x |
@summarize | 10.40s | 0.58s | 0.37s | 28x |
@regress | 0.89s | 1.93s | 0.16s | 6x |
Use any Julia function
@generate logHP = log(Horsepower)
Easily extendable with user-defined functions
The function can operate on individual elements,
get_make(text) = split(text, " ")[1]
@generate Make = Main.get_make(Model)
or on the entire column:
function geometric_mean(x::AbstractVector)
n = length(x)
return exp(sum(log.(x)) / n)
end
@collapse geom_NPG = Main.geometric_mean(MPG), by(Cylinders)
If you define a function in your own code, you need to prefix the function name with Main.
to use it in other commands. To make use of Automatic vectorization, make sure to give the function a vector argument type.
Commands
Setting and inspecting the global DataFrame
Kezdi.setdf
— Functionsetdf(df::Union{AbstractDataFrame, Nothing})
Set the global data frame.
Kezdi.getdf
— Functiongetdf() -> AbstractDataFrame
Return the global data frame.
Kezdi.@names
— Macro@names
Display the names of the variables in the data frame.
Kezdi.@list
— Macro@list
Display the entire data frame.
Kezdi.@head
— Macro@head [n]
Display the first n
rows of the data frame. By default, n
is 5.
Kezdi.@tail
— Macro@tail [n]
Display the last n
rows of the data frame. By default, n
is 5.
Filtering columns and rows
Kezdi.@keep
— Macro@keep y1 y2 ... [@if condition]
Keep only the variables y1
, y2
, etc. in df
. If condition
is provided, only the rows for which the condition is true are kept.
Kezdi.@drop
— Macro@drop y1 y2 ...
or @drop if condition]
Drop the variables y1
, y2
, etc. from df
. If condition
is provided, the rows for which the condition is true are dropped.
Modifying the data
Kezdi.@rename
— Macro@rename oldname newname
Rename the variable oldname
to newname
in the data frame.
Kezdi.@generate
— Macro@generate y = expr [@if condition]
Create a new variable y
in df
by evaluating expr
. If condition
is provided, the operation is executed only on rows for which the condition is true. When the condition is false, the variable will be missing.
Kezdi.@replace
— Macro@replace y = expr [@if condition]
Replace the values of y
in df
with the result of evaluating expr
. If condition
is provided, the operation is executed only on rows for which the condition is true. When the condition is false, the variable will be left unchanged.
Kezdi.@egen
— Macro@egen y1 = expr1 y2 = expr2 ... [@if condition], [by(group1, group2, ...)]
Generate new variables in df
by evaluating expressions expr1
, expr2
, etc. If condition
is provided, the operation is executed only on rows for which the condition is true. When the condition is false, the variables will be missing. If by
is provided, the operation is executed by group.
Kezdi.@collapse
— Macro@collapse y1 = expr1 y2 = expr2 ... [@if condition], [by(group1, group2, ...)]
Collapse df
by evaluating expressions expr1
, expr2
, etc. If condition
is provided, the operation is executed only on rows for which the condition is true. If by
is provided, the operation is executed by group.
Kezdi.@sort
— Macro@sort y1 y2 ...
Sort the data frame by the variables y1
, y2
, etc. in ascending order.
Summarizing and analyzing data
Kezdi.@count
— Macro@count if condition]
Count the number of rows for which the condition is true. If condition
is not provided, the total number of rows is counted.
Kezdi.@tabulate
— Macro@tabulate y1 y2 ... [@if condition]
Create a frequency table for the variables y1
, y2
, etc. in df
. If condition
is provided, the operation is executed only on rows for which the condition is true.
Kezdi.@summarize
— Macro@summarize y [@if condition]
Summarize the variable y
in df
. If condition
is provided, the operation is executed only on rows for which the condition is true.
Kezdi.@regress
— Macro@regress y x1 x2 ... [@if condition], [robust] [cluster(var1, var2, ...)]
Estimate a regression model in df
with dependent variable y
and independent variables x1
, x2
, etc. If condition
is provided, the operation is executed only on rows for which the condition is true. If robust
is provided, robust standard errors are calculated. If cluster
is provided, clustered standard errors are calculated.
Use on another DataFrame
Kezdi.With.@with
— Macro@with df begin
# do something with df
end
The @with
macro is a convenience macro that allows you to set the current data frame and perform operations on it in a single block. The first argument is the data frame to set as the current data frame, and the second argument is a block of code to execute. The data frame is set as the current data frame for the duration of the block, and then restored to its previous value after the block is executed.
The macro returns the value of the last expression in the block.
Kezdi.With.@with!
— Macro@with! df begin
# do something with df
end
The @with!
macro is a convenience macro that allows you to set the current data frame and perform operations on it in a single block. The first argument is the data frame to set as the current data frame, and the second argument is a block of code to execute. The data frame is set as the current data frame for the duration of the block, and then restored to its previous value after the block is executed.
The macro does not have a return value, it overwrites the data frame directly.
Differences to standard Julia and DataFrames syntax
To maximize convenience for Stata users, Kezdi.jl has a number of differences to standard Julia and DataFrames syntax.
Everything is a macro
While there are a few convenience functions, most Kezdi.jl commands are macros that begin with @
.
@tabulate Gear
Comma is used for options
Due to this non-standard syntax, Kezdi.jl uses the comma to separate options.
@regress log(MPG) log(Horsepower), robust
Here log(MPG)
and log(Horsepower)
are the dependent and independent variables, respectively, and robust
is an option. Options may also have arguments, like
@regress log(MPG) log(Horsepower), cluster(Cylinders)
Automatic variable name substitution
Column names of the data frame can be used directly in the commands without the need to prefix them with the data frame name or using a Symbol.
@generate logHP = log(Horsepower)
Other data manipulation packages in Julia require column names to be passed as symbols or strings. Kezdi.jl does not require this, and it will not work if you try to use symbols or strings.
Julia reserved words, like begin
, export
, function
and standard types like String
, Int
, Float64
, etc., cannot be used as variable names in Kezdi.jl. If you have a column with a reserved word, rename it before passing it to Kezdi.jl.
Automatic vectorization
All functions are automatically vectorized, so there is no need to use the .
operator to broadcast functions over elements of a column.
@generate logHP = log(Horsepower)
If you want to turn off automatic vectorization, use the convenience function DNV
("do not vectorize").
@generate logHP = DNV(log(Horsepower))
The exception is when the function operates on Vectors, in which case Kezdi.jl understands you want to apply the function to the entire column.
@collapse mean_HP = mean(Horsepower), by(Cylinders)
If you need to apply a function to individual elements of a column, you need to vectorize it with adding .
after the function name:
@generate words = split(Model, " ")
@generate n_words = length.(words)
Here, words
becomes a vector of vectors, where each element is a vector of words in the corresponding Model
string. The function legth.
will operate on each cell in words
, counting the number of words in each Model
string. By contrast, length(words)
would return the number of elements in the words
vector, which is the number of rows in the DataFrame.
The @if
condition
Almost every command can be followed by an @if
condition that filters the data frame. The command will only be executed on the subset of rows for which the condition evaluates to true
. The condition can use any combination of column names and functions.
@summarize MPG @if Horsepower > median(Horsepower)
Autovectorization rules also apply to @if
conditions. If you use a vector function, it will be evaluated on the entire column, before subseting the data frame. By contrast, vector functions in @generate
or @collapse
commands are evaluated on the subset of rows that satisfy the condition.
@generate HP_p75 = median(Horsepower) @if Horsepower > median(Horsepower)
This code computes the median of horsepower values above the median, that is, the 75th percentile of the horsepower distribution. Of course, you can more easily do this calculation with @summarize
:
s = @summarize Horsepower
s.p75
Handling missing values
Kezdi.jl ignores missing values when aggregating over entire columns.
@with DataFrame(A = [1, 2, missing, 4]) begin
@collapse mean_A = mean(A)
end
returns mean_A = 2.33
.
Row-count variables
The variable _n
refers to the row number in the data frame, _N
denotes the total number of rows. These can be used in @if
conditions, as well.
@with DataFrame(A = [1, 2, 3, 4]) begin
@keep @if _n < 3
end
Differences to Stata syntax
All commands begin with @
To allow for Stata-like syntax, all commands begin with @
. These are macros that rewrite your Kezdi.jl code to DataFrames.jl
commands.
@tabulate Gear
@keep @if Gear == 4
@keep Model MPG Horsepower Weight Displacement Cylinders
@if
condition also begins with @
The @if
condition is non-standard behavior in Julia, so it is also implemented as a macro.
@collapse
has same syntax as @egen
Unlike Stata, where egen
and collapse
have different syntax, Kezdi.jl uses the same syntax for both commands.
@egen mean_HP = mean(Horsepower), by(Cylinders)
@collapse mean_HP = mean(Horsepower), by(Cylinders)
Different function names
To maintain compatibility with Julia, we had to rename some functions. For example, count
is called rowcount
, missing
is called ismissing
in Kezdi.jl.
Convenience functions
Kezdi.distinct
— Functiondistinct(x::AbstractVector) = unique(x)
Convenience function to get the distinct values of a vector.
Kezdi.rowcount
— Functionrowcount(x::AbstractVector) = length(collect(skipmissing(x)))
Count the number of non-missing values in a vector.
Kezdi.DNV
— FunctionDNV(f(x))
Indicate that the function f
should not be vectorized. The name DNV is only used for parsing, do not call it directly.
Acknowledgements
Inspiration for the package came from Tidier.jl, a similar package launched by Karandeep Singh that provides a dplyr-like interface for Julia. Johannes Boehm has also developed a similar package, Douglass.jl.
The package is built on top of DataFrames.jl, FreqTables.jl and FixedEffectModels.jl. The @with
function relies on Chain.jl by Julius Krumbiegel.
The package is named after Gabor Kezdi, a Hungarian economist who has made significant contributions to teaching data analysis.
- stataStata is a registered trademark of StataCorp LLC. Kezdi.jl is not affiliated with StataCorp LLC.