Skip to content

What is Entity Relational Modeling?

import { Code } from ‘@astrojs/starlight/components’;

Entity Relational Modeling (ERD) is a way of organizing data into:

  1. Entities - Things that exist (people, firms, addresses)
  2. Relations - Connections between things (manages, owns, lives_at)

“It’s a 50-year-old concept. It’s very well researched. I think everybody should have understood this already.”

Our Hackathon Example: Hungarian Company Registry

Section titled “Our Hackathon Example: Hungarian Company Registry”

We have data about Hungarian companies with 1.3 million firms and 3 million people. The raw data comes in 5 CSV files:

flowchart LR
    subgraph Input["Raw CSV Files"]
        branch["branch.csv"]
        hq["hq.csv"]
        site["site.csv"]
        manage["manage.csv"]
        own["own.csv"]
    end

    subgraph Entities["Entity Tables"]
        orgs["organizations.parquet<br/>1,284,668 firms"]
        people["people.parquet<br/>3,013,739 people"]
        addr["addresses.parquet<br/>940,485 addresses"]
    end

    subgraph Relations["Relation Tables (SCD2)"]
        loc["location.parquet"]
        mgmt["manage.parquet"]
        ownership["own.parquet"]
        home["person_address.parquet"]
    end

    branch --> orgs
    hq --> orgs
    site --> orgs
    manage --> orgs
    own --> orgs

    manage --> people
    own --> people

    branch --> addr
    hq --> addr
    site --> addr
    manage --> addr
    own --> addr

    orgs --> loc
    addr --> loc
    orgs --> mgmt
    people --> mgmt
    orgs --> ownership
    people --> ownership
    people --> home
    addr --> home

Here’s the entity-relationship diagram we built during the hackathon:

erDiagram
    ORGANIZATIONS {
        string frame_id PK
    }

    PEOPLE {
        string person_id PK
        string sex
        int birth_year
    }

    ADDRESSES {
        string address_id PK
        string settlement
        int EOV_X
        int EOV_Y
    }

    ORGANIZATIONS ||--o{ LOCATION : "has location"
    ADDRESSES ||--o{ LOCATION : "is located at"
    ORGANIZATIONS ||--o{ MANAGE : "is managed by"
    PEOPLE ||--o{ MANAGE : "manages"
    ORGANIZATIONS ||--o{ OWN : "is owned by"
    PEOPLE ||--o{ OWN : "owns"
    PEOPLE ||--o{ PERSON_ADDRESS : "lives at"
    ADDRESSES ||--o{ PERSON_ADDRESS : "is home of"

Here’s what actual people data looks like after extraction:

┌──────────────────┬─────────┬────────────┐
│ person_id │ sex │ birth_year │
├──────────────────┼─────────┼────────────┤
│ ft11220745 │ NULL │ NULL │
│ PP148348_1270409 │ male │ 1986 │
│ PP146438_1269421 │ male │ 1974 │
│ PP139250_139251 │ male │ 1974 │
│ PP136480_1265814 │ female │ 1975 │
│ PP14780 │ male │ 1973 │
│ PP84682_2200 │ female │ NULL │
│ PHM529_3044 │ female │ NULL │
└──────────────────┴─────────┴────────────┘

Notice the ID prefixes:

  • ft - firms (yes, firms can be managers/owners!)
  • PP - natural persons from company registry
  • PHM - natural persons from other sources
  • FP, PR - other person types

Without ERD, you end up with the “single table trap.” During the hackathon, we discussed a painful example:

“Every single week, half of the meeting was about… this is supposed to be bid level, not tender level. It’s tender part level. Is this bidder level? No, it’s bid level.”

With proper ERD:

  • Entity tables have one row per entity (one person = one row)
  • Relation tables connect entities with timestamps
  • No confusion about unit of observation
Entity TableRowsPrimary Key
organizations1,284,668frame_id
people3,013,739person_id
addresses940,485address_id
Relation TableConnects
locationorg → address (HQ, branch, site)
manageorg → person (managers)
ownorg → person (owners with shares)
person_addressperson → address (home addresses)