The Problem
The Raw Data
Section titled “The Raw Data”We have Hungarian company registry data from Opten, loaded via bead:
bead input load motherlode-opten_20250104This gives us 5 CSV files:
flowchart TD
subgraph Input["input/motherlode-opten_20250104/"]
branch["branch.csv<br/>Branch offices"]
hq["hq.csv<br/>Headquarters"]
site["site.csv<br/>Other sites"]
manage["manage.csv<br/>Management relations"]
own["own.csv<br/>Ownership relations"]
end
What’s In Each File?
Section titled “What’s In Each File?”Location files (branch, hq, site):
frame_id | address_id | valid_from | valid_till | settlement | WGS84_lon | WGS84_lat | EOV_X | EOV_YRelation files (manage, own):
frame_id | manager_id/owner_id | valid_from | valid_till | sex | birth_year | address_id | ...Notice the problem? The same address_id column means different things:
- In location files → organization’s address
- In relation files → person’s home address
We didn’t realize this until tests failed!
The Scale
Section titled “The Scale”| Source | Approximate Rows |
|---|---|
| branch.csv | ~500K |
| hq.csv | ~1.3M |
| site.csv | ~200K |
| manage.csv | ~3M |
| own.csv | ~4M |
Total: millions of rows, but manageable on a laptop with DuckDB.
What We Need
Section titled “What We Need”Transform this mess into a clean relational model:
flowchart LR
subgraph Before["Raw CSVs (Messy)"]
direction TB
m1["Entities mixed with relations"]
m2["Duplicates everywhere"]
m3["Ambiguous columns"]
end
subgraph After["Parquet Tables (Clean)"]
direction TB
e["Entity Tables<br/>• organizations<br/>• people<br/>• addresses"]
r["Relation Tables (SCD2)<br/>• location<br/>• manage<br/>• own<br/>• person_address"]
end
Before --> After
The Tools
Section titled “The Tools”DuckDB: Analytical SQL database, perfect for this scale
duckdb -c "SELECT COUNT(*) FROM read_csv_auto('input/.../manage.csv')"Parquet: Columnar storage format, self-describing schema
# No schema definition needed - DuckDB infers itCOPY table TO 'output.parquet' (FORMAT PARQUET);Makefile: Orchestrates the pipeline
temp/entities/people.parquet: code/create/people.sql $(PEOPLE_CSV) mkdir -p $(dir $@) $(DUCKDB) < $<Our Actual Makefile
Section titled “Our Actual Makefile”Here’s the real Makefile from the hackathon:
DUCKDB := duckdb
INPUT := input/motherlode-opten_20250104TEMP_ENTITIES := temp/entitiesTEMP_SCD := temp/scd
ORG_CSV := $(INPUT)/branch.csv $(INPUT)/hq.csv $(INPUT)/site.csv \ $(INPUT)/manage.csv $(INPUT)/own.csvPEOPLE_CSV := $(INPUT)/manage.csv $(INPUT)/own.csvADDR_CSV := $(INPUT)/branch.csv $(INPUT)/hq.csv $(INPUT)/site.csv \ $(INPUT)/manage.csv $(INPUT)/own.csv
ENTITIES := $(TEMP_ENTITIES)/organizations.parquet \ $(TEMP_ENTITIES)/people.parquet \ $(TEMP_ENTITIES)/addresses.parquetSCD := $(TEMP_SCD)/location.parquet $(TEMP_SCD)/manage.parquet \ $(TEMP_SCD)/own.parquet $(TEMP_SCD)/person_address.parquet
all: $(ENTITIES) $(SCD)
test: all @echo "=== PK Uniqueness ===" && $(DUCKDB) < code/test/pk_uniqueness.sql @echo "=== FK Integrity ===" && $(DUCKDB) < code/test/fk_integrity.sqlNotice:
- Variables at the top for easy configuration
- Dependencies are explicit (which CSVs each table needs)
make testruns aftermake all
The Approach
Section titled “The Approach”flowchart TD
explore["1. Explore raw data<br/>DESCRIBE, sample queries"]
design["2. Design entity tables<br/>What are the PKs?"]
relations["3. Design relation tables<br/>What connects what?"]
write["4. Write SQL scripts<br/>One per table"]
test["5. Test immediately<br/>PK unique? FK valid?"]
fix["6. Fix bugs<br/>Iterate"]
explore --> design --> relations --> write --> test
test -->|FAIL| fix --> write
test -->|PASS| done["Done!"]