Does replacing string column with repeating values at some point with an int FK have any performance benefits in RDBMS? [duplicate]

Question

In order to store the country information for a person I did:

    CREATE TABLE test
  (
     id      INT IDENTITY(1, 1),
     name    VARCHAR(100) NOT NULL,
     country VARCHAR(100) NOT NULL,
     PRIMARY KEY(id)
  );

INSERT INTO test
VALUES      ('Amy', 'Mexico'),
            ('Tom', 'US'),
            ('Mark', 'Morocco'),
            ('Izzy', 'Mexico');
-- milions of other rows

A lot of the countries will repeat themselves in the country column.

Another option would be to get the country in it's own table and reference the country_id as a FK in the test table:

CREATE TABLE countries
  (
     id   INT IDENTITY(1, 1),
     name VARCHAR(100) NOT NULL,
     PRIMARY KEY(id)
  );

CREATE TABLE test
  (
     id         INT IDENTITY(1, 1),
     name       VARCHAR(100) NOT NULL,
     country_id INT NOT NULL,
     PRIMARY KEY(id),
     FOREIGN KEY(country_id) REFERENCES countries(id)
  );

My question is: is there benefit of doing the second scenario from performance point of view/ indexes point of view or it's just cumbersome to do so? ( I know I am not breaking any normal form with the first scenario)

Tim Biegeleisen · Accepted Answer · 2024-03-03 11:22:08Z

0

The second version has an obvious performance benefit, namely that only multiple country IDs need be stored for each person-country relationship. This, in turn, means that your storage requirements for the tables and indices would be reduced.

Because the index of the second version would use the integer country ID rather than a string name, I would expect index performance to improve. Your database doesn't "know" that there are only a fixed number of countries. So, the index for the first version would be a B-tree splitting across text, rather than integers. And the former is more verbose than the latter.

answered Mar 3, 2024 at 11:22

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Charlieface Over a year ago

Depends how wide it is. Country codes have a standard two-letter code, so could be 2 bytes rather than 4. You also save one join.

NickW · Accepted Answer · 2024-03-03 12:12:44Z

0

The second one will, in general, have poorer performance as joins are always costly. So any query of test that includes the country would need to join to the country table. The storage difference is not going to affect query performance.

In the real world, a country entity would likely have multiple attributes (iso codes, regions, populations, etc) and therefore would need to be normalised into its own entity. You’d then need to join to it, and have the performance hit of joining, but that is, in general, outweighed by the benefits of normalisation - which is why we use normalisation rather than “one big table”

answered Mar 3, 2024 at 12:12

NickW

10k2 gold badges10 silver badges26 bronze badges

1 Comment

Charlieface Over a year ago

Doesn't mean you need to use surrogate keys, you could use the 2-letter country code as the PK/FK, saving a join in many queries.

Collectives™ on Stack Overflow

Does replacing string column with repeating values at some point with an int FK have any performance benefits in RDBMS? [duplicate]

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related