Common Mistakes
Mistake 1: The Single Table Too Early
Section titled “Mistake 1: The Single Table Too Early”flowchart LR
subgraph Wrong["The Temptation"]
single["One Big Table<br/>firm + owner + manager + address"]
end
subgraph Right["The Right Way"]
ent["Entity Tables"]
rel["Relation Tables"]
analysis["Analysis Table<br/>(created last)"]
ent --> rel --> analysis
end
“If you don’t do this, you create a single table too early, then you will keep chasing yourself around in recovery.”
In our hackathon data:
- If we merged firms + owners immediately, we’d have 1.3M × multiple owners = explosion
- Then adding managers = more explosion
- Then adding historical data = complete mess
Rule: Keep entity and relation tables separate. Merge only for a specific analysis.
Mistake 2: Confusing Entity vs Relation Attributes
Section titled “Mistake 2: Confusing Entity vs Relation Attributes”We made this mistake! Look at manage.csv:
| Column | Is it… |
|---|---|
| frame_id | FK to organization |
| manager_id | FK to person |
| valid_from | Relation attribute |
| sex | Person attribute? Or relation attribute? |
| birth_year | Person attribute? Or relation attribute? |
| address_id | ??? |
We initially thought address_id was the organization’s address. Wrong! It’s the manager’s home address.
flowchart TD
subgraph Wrong["Wrong interpretation"]
manage1["manage.csv address_id"]
loc1["organization location"]
manage1 --> loc1
end
subgraph Right["Correct interpretation"]
manage2["manage.csv address_id"]
home["manager's home address"]
manage2 --> home
end
The FK test caught this:
person_address.address_id -> addresses: FAILMistake 3: Not Testing PK Uniqueness Immediately
Section titled “Mistake 3: Not Testing PK Uniqueness Immediately”Our first people.sql:
-- WRONG: Produces duplicates!CREATE TABLE people ASSELECT DISTINCT person_id, sex, birth_year FROM (...);We didn’t test until later. When we did:
people: 3,894,436 total, 3,013,739 unique → 880,697 DUPLICATES!Rule: Test PK uniqueness immediately after creating any entity table.
-- Add this to your workflow:SELECT COUNT(*) = COUNT(DISTINCT person_id) AS pk_uniqueFROM people;Mistake 4: Trusting NULL Handling
Section titled “Mistake 4: Trusting NULL Handling”SQL’s NULL handling is tricky:
-- These are considered DIFFERENT by DISTINCT:('male', 1986)('male', NULL)(NULL, 1986)Our fix used MODE():
SELECT person_id, MODE(sex) AS sex, -- picks most frequent, prefers non-NULL MODE(birth_year) AS birth_yearFROM (...)GROUP BY person_id;Mistake 5: Schemas in Documentation
Section titled “Mistake 5: Schemas in Documentation”We wrote column types in the README:
| EOV_X | double || EOV_Y | double |Then changed the code to use integers (EOV is in meters, no decimals needed).
“The schema is gonna be in the header of the parquet anyway… if they become out of sync, then your agent’s gonna get confused.”
Rule: Don’t duplicate schema in documentation. Let the data be self-documenting.
Mistake 6: Not Using Standard Names
Section titled “Mistake 6: Not Using Standard Names”Good names that AI understands:
| Convention | Meaning |
|---|---|
temp/scd/ | Slowly Changing Dimensions |
valid_from, valid_till | SCD2 time bounds |
*_id suffix | Primary/foreign key |
code/create/ | Scripts that create tables |
code/test/ | Test scripts |
During the hackathon:
“As soon as Haiku saw that the folder name was scd, it immediately realized that it’s slowly changing dimension.”
Use standard conventions. The AI knows them.
The Test-Driven Data Pipeline
Section titled “The Test-Driven Data Pipeline”Our final workflow:
flowchart TD
write["Write SQL script"]
run["Run: make table.parquet"]
test_pk["Test PK uniqueness"]
test_fk["Test FK integrity"]
commit["Commit if all pass"]
write --> run --> test_pk
test_pk -->|PASS| test_fk
test_pk -->|FAIL| write
test_fk -->|PASS| commit
test_fk -->|FAIL| write
# Our actual commands:make temp/entities/people.parquetmake test # runs pk_uniqueness.sql and fk_integrity.sqlgit add -A && git commit -m "Add people entity table"