Testing the Schema
Why Test?
Section titled “Why Test?”“This is good practice to save tests and run them from time to time.”
Data pipelines break silently. Without tests, you won’t know until your analysis produces nonsense.
flowchart TD
write["Write SQL script"]
run["Run: make table.parquet"]
test_pk["Test PK uniqueness"]
test_fk["Test FK integrity"]
commit["Commit if all pass"]
write --> run --> test_pk
test_pk -->|PASS| test_fk
test_pk -->|FAIL| write
test_fk -->|PASS| commit
test_fk -->|FAIL| write
Test 1: Primary Key Uniqueness
Section titled “Test 1: Primary Key Uniqueness”Every entity table must have unique primary keys:
-- code/test/pk_uniqueness.sqlSELECT '=== PK Uniqueness Tests ===' AS test;
-- Organizations: frame_id should be uniqueSELECT 'organizations: frame_id unique' AS test, CASE WHEN (SELECT COUNT(*) FROM 'temp/entities/organizations.parquet') = (SELECT COUNT(DISTINCT frame_id) FROM 'temp/entities/organizations.parquet') THEN 'PASS' ELSE 'FAIL' END AS result;
-- People: person_id should be uniqueSELECT 'people: person_id unique' AS test, CASE WHEN (SELECT COUNT(*) FROM 'temp/entities/people.parquet') = (SELECT COUNT(DISTINCT person_id) FROM 'temp/entities/people.parquet') THEN 'PASS' ELSE 'FAIL' END AS result;
-- Addresses: address_id should be uniqueSELECT 'addresses: address_id unique' AS test, CASE WHEN (SELECT COUNT(*) FROM 'temp/entities/addresses.parquet') = (SELECT COUNT(DISTINCT address_id) FROM 'temp/entities/addresses.parquet') THEN 'PASS' ELSE 'FAIL' END AS result;Run with:
duckdb < code/test/pk_uniqueness.sqlOutput:
┌────────────────────────────────┬─────────┐│ test │ result │├────────────────────────────────┼─────────┤│ organizations: frame_id unique │ PASS ││ people: person_id unique │ PASS ││ addresses: address_id unique │ PASS │└────────────────────────────────┴─────────┘Test 2: Foreign Key Integrity
Section titled “Test 2: Foreign Key Integrity”Every foreign key in a relation table must reference an existing primary key:
-- code/test/fk_integrity.sqlSELECT '=== FK Integrity Tests ===' AS test;
-- location: frame_id must exist in organizationsSELECT 'location.frame_id -> organizations' AS test, CASE WHEN (SELECT COUNT(DISTINCT frame_id) FROM 'temp/scd/location.parquet' WHERE frame_id NOT IN (SELECT frame_id FROM 'temp/entities/organizations.parquet')) = 0 THEN 'PASS' ELSE 'FAIL' END AS result;
-- location: address_id must exist in addressesSELECT 'location.address_id -> addresses' AS test, CASE WHEN (SELECT COUNT(DISTINCT address_id) FROM 'temp/scd/location.parquet' WHERE address_id NOT IN (SELECT address_id FROM 'temp/entities/addresses.parquet')) = 0 THEN 'PASS' ELSE 'FAIL' END AS result;
-- ... similar tests for manage, own, person_addressMakefile Integration
Section titled “Makefile Integration”Add a test target:
.PHONY: all test clean
all: $(ALL)
test: $(ALL) $(DUCKDB) < code/test/pk_uniqueness.sql $(DUCKDB) < code/test/fk_integrity.sql
clean: rm -rf temp/entities temp/scdNow make test runs all tests after building.
When Tests Fail
Section titled “When Tests Fail”We had a failure:
person_address.address_id -> addresses: FAILThis meant addresses in manage.csv and own.csv weren’t in our addresses entity table.
Debug process:
- Identify which addresses are missing
- Trace back to source files
- Realize these are person home addresses, not org addresses
- Update
addresses.sqlto include them - Re-run tests: PASS
Test Patterns
Section titled “Test Patterns”| Test | What It Catches |
|---|---|
| PK unique | Duplicate entities from bad deduplication |
| FK integrity | Missing entities, wrong join columns |
| Row counts | Silent filtering, missing sources |
| NULL checks | Unexpected missing data |
| Date validity | valid_from > valid_till errors |
Sample Data Verification
Section titled “Sample Data Verification”Beyond automated tests, always eyeball samples:
-- Quick sanity checkSELECT * FROM 'temp/entities/people.parquet' LIMIT 25;Look for:
- Reasonable values
- Expected data types
- ID formats that make sense
- NULLs where expected (and not where unexpected)