Relation Table Design
The Four Relations
Section titled “The Four Relations”Our data model has four SCD2 relation tables:
flowchart TD
subgraph Entities["Entity Tables"]
org["organizations<br/>1,284,668"]
ppl["people<br/>3,013,739"]
addr["addresses<br/>940,485"]
end
subgraph Relations["Relation Tables (SCD2)"]
loc["location<br/>org → address"]
manage["manage<br/>org → person"]
own["own<br/>org → person"]
pa["person_address<br/>person → address"]
end
org --> loc
addr --> loc
org --> manage
ppl --> manage
org --> own
ppl --> own
ppl --> pa
addr --> pa
Location: Organization → Address
Section titled “Location: Organization → Address”Organizations have locations (HQ, branches, sites) that change over time:
-- code/create/location.sqlCREATE TABLE location ASSELECT frame_id, address_id, valid_from, valid_till, loc_typeFROM ( SELECT frame_id, address_id, valid_from, valid_till, 'hq' AS loc_type FROM read_csv_auto('input/motherlode-opten_20250104/hq.csv') WHERE frame_id IS NOT NULL AND address_id IS NOT NULL UNION ALL SELECT frame_id, address_id, valid_from, valid_till, 'branch' AS loc_type FROM read_csv_auto('input/motherlode-opten_20250104/branch.csv') WHERE frame_id IS NOT NULL AND address_id IS NOT NULL UNION ALL SELECT frame_id, address_id, valid_from, valid_till, 'site' AS loc_type FROM read_csv_auto('input/motherlode-opten_20250104/site.csv') WHERE frame_id IS NOT NULL AND address_id IS NOT NULL);
COPY location TO 'temp/scd/location.parquet' (FORMAT PARQUET);Key point: Only organization addresses go here (branch, hq, site).
Manage: Organization → Person
Section titled “Manage: Organization → Person”Management relationships with full attributes:
-- code/create/manage.sqlCREATE TABLE manage ASSELECT frame_id, manager_id AS person_id, valid_from, valid_till, manager_type, sex, birth_year, consistent, country_code, board, position, self_liquidator, liquidatorFROM read_csv_auto('input/motherlode-opten_20250104/manage.csv')WHERE frame_id IS NOT NULL AND manager_id IS NOT NULL;
COPY manage TO 'temp/scd/manage.parquet' (FORMAT PARQUET);Note: We keep sex and birth_year here even though they’re also in people. This is intentional - they record the value at the time of the relationship, which might differ from the current entity value.
Own: Organization → Person
Section titled “Own: Organization → Person”Ownership with share information:
-- code/create/own.sqlCREATE TABLE own ASSELECT frame_id, owner_id AS person_id, valid_from, valid_till, owner_type, sex, birth_year, consistent, country, share, share_flag, share_sourceFROM read_csv_auto('input/motherlode-opten_20250104/own.csv')WHERE frame_id IS NOT NULL AND owner_id IS NOT NULL;
COPY own TO 'temp/scd/own.parquet' (FORMAT PARQUET);Person Address: The Bug Fix
Section titled “Person Address: The Bug Fix”We initially forgot this relation entirely. The address_id in manage.csv and own.csv is the person’s home address, not the organization’s location.
flowchart TD
subgraph Wrong["Initial Wrong Design"]
csv1["manage.csv<br/>address_id column"]
loc1["location table<br/>org → address"]
csv1 -->|"Thought it was org address"| loc1
end
subgraph Test["FK Test Failed!"]
fail["person_address.address_id<br/>NOT IN addresses → FAIL"]
end
subgraph Right["Correct Design"]
csv2["manage.csv<br/>address_id column"]
pa["person_address table<br/>person → address"]
csv2 -->|"Actually person home address"| pa
end
Wrong --> Test --> Right
The bug: FK integrity test failed:
person_address.address_id -> addresses: FAILThe fix: Create a separate relation table:
-- code/create/person_address.sqlCREATE TABLE person_address ASSELECT manager_id AS person_id, address_id, valid_from, valid_till, 'manager' AS relation_typeFROM read_csv_auto('input/motherlode-opten_20250104/manage.csv')WHERE manager_id IS NOT NULL AND address_id IS NOT NULLUNION ALLSELECT owner_id AS person_id, address_id, valid_from, valid_till, 'owner' AS relation_typeFROM read_csv_auto('input/motherlode-opten_20250104/own.csv')WHERE owner_id IS NOT NULL AND address_id IS NOT NULL;
COPY person_address TO 'temp/scd/person_address.parquet' (FORMAT PARQUET);Also had to update addresses.sql to include these addresses (see previous section).
The Final ERD
Section titled “The Final ERD”erDiagram
ORGANIZATIONS ||--o{ LOCATION : "has location"
ADDRESSES ||--o{ LOCATION : "is located at"
ORGANIZATIONS ||--o{ MANAGE : "is managed by"
PEOPLE ||--o{ MANAGE : "manages"
ORGANIZATIONS ||--o{ OWN : "is owned by"
PEOPLE ||--o{ OWN : "owns"
PEOPLE ||--o{ PERSON_ADDRESS : "lives at"
ADDRESSES ||--o{ PERSON_ADDRESS : "is home of"
Four relation tables connecting three entity tables. All with SCD2 (valid_from, valid_till).