How to store a large amount of data containing strings in Postgres SQL in an efficient manner?

Question

I have two tables:

Table 1: Date (String: 'YYYYMMDD'), 
         Path (String: '/user/tom/', a directory path), 
         Size (Bigint: 1293232943, size of directory)

Table 2: Date (String: 'YYYYMMDD'), 
         Path (String: '/user/tom/logs/file.txt', a file path), 
         Count (Bigint: 282, number of times file has been opened)

I would need to make several various queries that take collect the total access count under a directory which would be found by querying table 2 and looking for all paths that are like concat(t1.Path, '%'). Is there a better way to structure this table so that queries like this are efficient and most importantly, the database does not take too much space.

So I tried creating a third table for storing an id for each path, and while the query is more complex, it stores less data but it's still somewhat meaning less since there is no structure to the id. It assigns a number to each path as long as its not in it already.

I'm trying to find the most efficient way and that saves space to store this data. Any help would be appreciated.

Do not store date values as strings. Store them as date

user330315
– user330315

2019-09-17 21:34:23 +00:00
Commented Sep 17, 2019 at 21:34 — user330315
– user330315, Commented Sep 17, 2019 at 21:34
How many rows do you expect to be stored in table 2?

user330315
– user330315

2019-09-17 21:35:19 +00:00
Commented Sep 17, 2019 at 21:35 — user330315
– user330315, Commented Sep 17, 2019 at 21:35

dougp · Accepted Answer · 2019-09-17 23:36:18Z

Do you have two tables, or do you hope to have two tables once you get them designed? I'll assume you are designing the tables.

Looking at https://www.postgresql.org/docs/9.5/datatype.html...

Your date can be stored as text (9 bytes), a date (8 bytes), or a smallint (2 bytes).

Your path should probably be stored as text (content + 1 byte). If you can be more specific about your maximum size requirements, a varchar(n) would improve the "self-documentation" of the database design (and take the same amount of space as text).

Your size and count could be stored as smallint (2 bytes), int (4 bytes), or bigint (8 bytes) depending on the maximum values that may be in that column. Based on the values you provided, size would be an int and count would be a smallint.

In my experience, databases are really fast with integers, although indexing will affect that as well. If yyyymmdd is the required format for the storage of the date values, I would store it as a smallint, not a character type.

So if you go with...

date     smallint
path     text
sub-path text
size     int
count    smallint

...and given the values you provided, each row in table1 will be 15 bytes and each row in table2 will be 32 bytes.

As far as making the queries fast, that's a matter of how the queries are written as well as other factors, like available server resources and indexing.

You can add surrogate keys in the tables and set primary keys, as appropriate. I would use int for the surrogate keys, but this depends on the maximum values (number of rows) in the tables. Joining on indexed fields (like primary keys) is fast. Remember, another column means more storage requirement. But storage is cheap. I'd go with this option unless you have extraordinary space limitations.

If you don't use surrogate keys, experiment with query performance using joins other than table2.sub_path like concat(table1.path, '%'). like is slow. You might try something like table1.path = substring(table2.sub_path from 1 for char_length(table1.path)), although throwing multiple computations at the join expression may make it worse.

Erwin Brandstetter · Accepted Answer · 2019-09-17 23:22:51Z

I'm trying to find the most efficient way and that saves space to store this data.

CREATE TABLE table1 (
  table1_id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, some_date date NOT NULL         -- 'YYYYMMDD' 
, path      text UNIQUE NOT NULL  -- 'user/tom'
, size      bigint                -- 1293232943
);

CREATE TABLE table2 (
  table1_id    int  NOT NULL REFERENCES table1  -- 12345
, some_date    date NOT NULL                    -- 'YYYYMMDD' 
, sub_path     text NOT NULL                    -- 'logs/file.txt'
, count_opened int                              -- 282
, PRIMARY KEY (table1_id, sub_path)
);

Add a surrogate PK to table1 and reference that in table2 to minimize storage. (You typically need a PK for table1 anyway.) Paths tend to get long, so this should pay.

Don't store noise like leading and trailing '/' if it can be assumed to be there always. You can add that in for display cheaply.

Use proper date type for dates (4 bytes).

Assuming count_opened does not go beyond int range (?). Else stick with bigint.

What can or needs to be NOT NULL depends on unknown details of the use case.

For convenient viewing you might add a view:

CREATE VIEW table2_full AS
SELECT table1_id
     , '\' || t1.path || '\' || t2.sub_path AS total_path
     , t2.some_date, t2.sub_path, t2.count_opened
FROM   table1 t1
JOIN   table2 t2 USING (table1_id)

... collect the total access count under a directory ...
Is there a better way to structure this table so that queries like this are efficient and most importantly, the database does not take too much space.

SELECT sum(count_opened) AS total_access_count)
FROM   table2
WHERE  table1_id = 12345;

This is very fast, supported by the index of the PK. Considerably faster than matching leading strings for the purpose (even if optimal index suported).

To get best read performance from an index-only scan you might add another multicolumn index on table2 (table1_id, count_opened).

If you only have the path for the query, look up the ID with a subquery:

SELECT sum(count_opened) AS total_access_count)
FROM   table2
WHERE  table1_id = (SELECT table1_id FROM table1 WHERE path = 'user/tom');

Again, for optimal read performance, add a multicolumn index on table1 (path, table_id).

Collectives™ on Stack Overflow

How to store a large amount of data containing strings in Postgres SQL in an efficient manner?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related