Entity Table Design
The Goal
Section titled “The Goal”Extract unique entities from messy source files:
flowchart LR
subgraph Sources["Raw CSVs"]
s1["branch.csv<br/>~500K rows"]
s2["hq.csv<br/>~1.3M rows"]
s3["site.csv<br/>~200K rows"]
s4["manage.csv<br/>~3M rows"]
s5["own.csv<br/>~4M rows"]
end
subgraph Entities["Entity Tables"]
org["organizations.parquet<br/>1,284,668 unique"]
ppl["people.parquet<br/>3,013,739 unique"]
addr["addresses.parquet<br/>940,485 unique"]
end
s1 & s2 & s3 & s4 & s5 --> org
s4 & s5 --> ppl
s1 & s2 & s3 & s4 & s5 --> addr
Organizations
Section titled “Organizations”The simplest entity - just extract unique frame_id values:
-- code/create/organizations.sqlCREATE TABLE organizations ASSELECT DISTINCT frame_idFROM ( SELECT frame_id FROM read_csv_auto('input/motherlode-opten_20250104/branch.csv') UNION ALL SELECT frame_id FROM read_csv_auto('input/motherlode-opten_20250104/hq.csv') UNION ALL SELECT frame_id FROM read_csv_auto('input/motherlode-opten_20250104/site.csv') UNION ALL SELECT frame_id FROM read_csv_auto('input/motherlode-opten_20250104/manage.csv') UNION ALL SELECT frame_id FROM read_csv_auto('input/motherlode-opten_20250104/own.csv'));
COPY organizations TO 'temp/entities/organizations.parquet' (FORMAT PARQUET);Result: 1,284,668 unique organizations.
People
Section titled “People”People are more complex - they appear in manage.csv and own.csv with attributes:
flowchart TD
subgraph Sources["Source Files"]
m["manage.csv<br/>manager_id, sex, birth_year"]
o["own.csv<br/>owner_id, sex, birth_year"]
end
subgraph Problem["The Duplication Problem"]
r1["PP148348 | male | 1986"]
r2["PP148348 | male | NULL"]
r3["PP148348 | NULL | 1986"]
end
m --> r1
o --> r2
m --> r3
-- First attempt (WRONG)CREATE TABLE people ASSELECT DISTINCT person_id, sex, birth_year FROM ( SELECT manager_id AS person_id, sex, birth_year FROM read_csv_auto('input/motherlode-opten_20250104/manage.csv') UNION ALL SELECT owner_id AS person_id, sex, birth_year FROM read_csv_auto('input/motherlode-opten_20250104/own.csv'));Problem: 3,894,436 rows but only 3,013,739 unique person_ids!
The same person has different sex or birth_year values across records (mostly NULL vs non-NULL). SQL’s DISTINCT treats (male, 1986) and (male, NULL) as different tuples.
The Fix: MODE() Aggregation
Section titled “The Fix: MODE() Aggregation”flowchart LR
subgraph Input["Multiple Records"]
i1["male, 1986"]
i2["male, NULL"]
i3["NULL, 1986"]
end
mode["MODE()<br/>Pick most frequent"]
subgraph Output["Single Record"]
o1["male, 1986"]
end
i1 & i2 & i3 --> mode --> o1
-- code/create/people.sql (CORRECT)CREATE TABLE people ASSELECT person_id, MODE(sex) AS sex, MODE(birth_year) AS birth_yearFROM ( SELECT manager_id AS person_id, sex, birth_year FROM read_csv_auto('input/motherlode-opten_20250104/manage.csv') WHERE manager_id IS NOT NULL UNION ALL SELECT owner_id AS person_id, sex, birth_year FROM read_csv_auto('input/motherlode-opten_20250104/own.csv') WHERE owner_id IS NOT NULL)GROUP BY person_id;
COPY people TO 'temp/entities/people.parquet' (FORMAT PARQUET);MODE() picks the most frequent value, preferring non-NULL over NULL.
Result: 3,013,739 unique people. All PKs unique.
Addresses
Section titled “Addresses”Initial design only included organization addresses:
-- First attempt (INCOMPLETE)CREATE TABLE addresses ASSELECT DISTINCT address_id, settlement, WGS84_lon, WGS84_lat, EOV_X, EOV_YFROM ( SELECT ... FROM read_csv_auto('input/motherlode-opten_20250104/branch.csv') UNION ALL SELECT ... FROM read_csv_auto('input/motherlode-opten_20250104/hq.csv') UNION ALL SELECT ... FROM read_csv_auto('input/motherlode-opten_20250104/site.csv'));Problem discovered later: manage.csv and own.csv contain person home addresses. FK integrity test failed!
The Fix: Include All Address Sources
Section titled “The Fix: Include All Address Sources”-- code/create/addresses.sql (CORRECT)CREATE TABLE addresses ASSELECT DISTINCT address_id, settlement, WGS84_lon, WGS84_lat, EOV_X, EOV_YFROM ( -- Organization addresses (with geo data) SELECT address_id, settlement, WGS84_lon, WGS84_lat, EOV_X, EOV_Y FROM read_csv_auto('input/motherlode-opten_20250104/branch.csv') WHERE address_id IS NOT NULL UNION ALL SELECT address_id, settlement, WGS84_lon, WGS84_lat, EOV_X, EOV_Y FROM read_csv_auto('input/motherlode-opten_20250104/hq.csv') WHERE address_id IS NOT NULL UNION ALL SELECT address_id, settlement, WGS84_lon, WGS84_lat, EOV_X, EOV_Y FROM read_csv_auto('input/motherlode-opten_20250104/site.csv') WHERE address_id IS NOT NULL UNION ALL -- Person addresses (no geo data in source) SELECT address_id, NULL, NULL, NULL, NULL, NULL FROM read_csv_auto('input/motherlode-opten_20250104/manage.csv') WHERE address_id IS NOT NULL UNION ALL SELECT address_id, NULL, NULL, NULL, NULL, NULL FROM read_csv_auto('input/motherlode-opten_20250104/own.csv') WHERE address_id IS NOT NULL);
COPY addresses TO 'temp/entities/addresses.parquet' (FORMAT PARQUET);Person addresses have NULL geo data because the source doesn’t provide it.
The Lesson
Section titled “The Lesson”Entity extraction seems simple but hides complexity:
- Duplicates from multiple sources
- Conflicting attribute values
- Missing entities that only appear in relations
Always test PK uniqueness immediately after creating an entity table.