6

Consider the following data model in a PostgreSQL v13 system;

parent-child data model

Here, parent table dim contains a small set of reference data, and child table fact contains a much higher volume of records. A typical use case for these data sets would be to query all fact::value's data belonging to a dim::name. Note that dim::name holds a UNIQUE constraint.

While I think this is a very common scenario, I was somewhat taken aback that the style of queries I've been using for years on other RDBMS's (Oracle, MSSQL) didn't perform at all on PostgreSQL the way I imagined they would. That is, when querying a dataset (fact) using a highly selective, but implicit, predicate (fact::dim_id eq X) through a join condition, I expect the index on fact::dim_id to be used (in a nested-loop). Instead, a hash-join is used, requiring a full table scan of fact.

Question: is there some way I can nudge the query planner into considering any predicate I issue on a joined relation to not need a full table scan? (without impacting other DB loads)

To illustrate the problem with an example, these tables are populated with some random data;

CREATE TABLE dim(
  id       SERIAL NOT NULL
, name     TEXT   NOT NULL
, CONSTRAINT pk_dim PRIMARY KEY (id)
, CONSTRAINT uq_dim UNIQUE (name)
);

CREATE TABLE fact(
  id        SERIAL  NOT NULL
, dim_id    INTEGER NOT NULL
, value     TEXT
, CONSTRAINT pk_fact PRIMARY KEY (id)
, CONSTRAINT fk_facts_dim FOREIGN KEY (dim_id) REFERENCES dim (id)
);

CREATE INDEX idx_fact_dim ON fact(dim_id);

INSERT INTO dim(name)
SELECT SUBSTRING(md5(random()::TEXT) FOR 5)
FROM   generate_series(1,50)
UNION
SELECT 'key';

INSERT INTO fact(dim_id, value)
SELECT (SELECT id FROM dim ORDER BY random() LIMIT 1)
,      md5(random()::TEXT)
FROM   generate_series(1,1000000);

ANALYZE dim;
ANALYZE fact;
EXPLAIN ANALYZE
SELECT f.*
FROM   fact AS f
JOIN   dim  AS d
       ON (d.id = f.dim_id)
WHERE  d.name = 'key';       -- Note: UNIQUE

                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1001.65..18493.29 rows=20588 width=41) (actual time=319.331..322.582 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Hash Join  (cost=1.65..15434.49 rows=8578 width=41) (actual time=306.193..306.195 rows=0 loops=3)
         Hash Cond: (f.dim_id = d.id)
         ->  Parallel Seq Scan on fact f  (cost=0.00..14188.98 rows=437498 width=41) (actual time=0.144..131.050 rows=350000 loops=3)
         ->  Hash  (cost=1.64..1.64 rows=1 width=4) (actual time=0.138..0.139 rows=1 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Seq Scan on dim d  (cost=0.00..1.64 rows=1 width=4) (actual time=0.099..0.109 rows=1 loops=3)
                     Filter: (name = 'key'::text)
                     Rows Removed by Filter: 50
 Planning Time: 1.059 ms
 Execution Time: 322.662 ms

Now, we execute the same question, but instead of filtering using an inner join, we filter using a scalar subquery;

EXPLAIN ANALYZE
SELECT *
FROM   fact
WHERE  dim_id = (SELECT id FROM dim WHERE name = 'key');

                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_fact_dim on fact  (cost=2.07..15759.53 rows=524998 width=41) (actual time=0.096..0.097 rows=0 loops=1)
   Index Cond: (dim_id = $0)
   InitPlan 1 (returns $0)
     ->  Seq Scan on dim  (cost=0.00..1.64 rows=1 width=4) (actual time=0.046..0.054 rows=1 loops=1)
           Filter: (name = 'key'::text)
           Rows Removed by Filter: 50
 Planning Time: 0.313 ms
 Execution Time: 0.156 ms

As shown, the performance difference is huge. Somehow, the query planner did not consider the predicate on the unique dim::name attribute to be equal to a predicate on fact::dim_id in the first query.

0

1 Answer 1

0

I don't know how to nudge the planner to do the right thing in your example, and I agree that the filter condition should be pushed inside join. However, there is a mistake in your database population code, and on realistic data the planner does a decent job and indeed pushes the condition through join (Postgres 16.3). This is too long for a comment, hence posting as a partial answer.

The subquery in populating fact table is decoupled from the outer query and executed only once:

INSERT INTO fact(dim_id, value)
SELECT (SELECT id FROM dim ORDER BY random() LIMIT 1)
,      md5(random()::TEXT)
FROM   generate_series(1,1000000);

This issue was definitely present in your analysis, because your query plans returned 0 rows.

To make the code do what we want it to do, I added a spurious coupling (if you know a better way, please share):

INSERT INTO fact(dim_id, value)
SELECT (SELECT id FROM dim ORDER BY random() + 0*i LIMIT 1)
,      md5(random()::TEXT)
FROM generate_series(1,1000000) i;

Now the "slow" and the "fast" queries are roughly equivalent:

SELECT f.*
FROM   fact AS f
JOIN   dim  AS d                                        
       ON (d.id = f.dim_id)
WHERE  d.name = 'key';       -- Note: UNIQUE
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=174.49..9963.30 rows=19608 width=41) (actual time=2.498..21.194 rows=19661 loops=1)
   ->  Seq Scan on dim d  (cost=0.00..1.64 rows=1 width=4) (actual time=0.012..0.018 rows=1 loops=1)
         Filter: (name = 'key'::text)
         Rows Removed by Filter: 50
   ->  Bitmap Heap Scan on fact f  (cost=174.49..9765.59 rows=19608 width=41) (actual time=2.481..17.209 rows=19661 loops=1)
         Recheck Cond: (dim_id = d.id)
         Heap Blocks: exact=8252
         ->  Bitmap Index Scan on idx_fact_dim  (cost=0.00..169.59 rows=19608 width=0) (actual time=1.262..1.262 rows=19661 loops=1)
               Index Cond: (dim_id = d.id)
 Planning Time: 0.843 ms
 Execution Time: 21.998 ms
(11 rows)

compare to:

SELECT *                  
FROM   fact                  
WHERE  dim_id = (SELECT id FROM dim WHERE name = 'key');
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on fact  (cost=176.12..9767.22 rows=19608 width=41) (actual time=2.782..15.379 rows=19661 loops=1)
   Recheck Cond: (dim_id = $0)
   Heap Blocks: exact=8252
   InitPlan 1 (returns $0)
     ->  Seq Scan on dim  (cost=0.00..1.64 rows=1 width=4) (actual time=0.008..0.012 rows=1 loops=1)
           Filter: (name = 'key'::text)
           Rows Removed by Filter: 50
   ->  Bitmap Index Scan on idx_fact_dim  (cost=0.00..169.59 rows=19608 width=0) (actual time=1.370..1.371 rows=19661 loops=1)
         Index Cond: (dim_id = $0)
 Planning Time: 0.255 ms
 Execution Time: 16.174 ms
(11 rows)```

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.