Performance of selecting strings: substring IN vs regular expression

Question

I have a large table with 1 billion+ records. These are the important columns:

CREATE TABLE BigTable
(
   id BIGSERIAL PRIMARY KEY NOT NULL,
   hexid VARCHAR(255), -- usually 24 or 48 characters
   type VARCHAR(255),
   <and other columns>
);

I need to find some records matching on a substring of the hexid column as well as the type column. There is an index on the hexid column. I need to search for a list of strings matching the last 24 characters of the hexid (the records I am interested in will always have 48 character hexids). There are two queries that I could run and I'm wondering which will perform better:

Query 1 using WHERE IN:

SELECT * FROM BigTable
WHERE substring(hexid, 25) IN (
    'deadbeef4b00018c09bf5db1',
    '5f478c50deadbeef97039344',
    '5fc0b855a6a8b600deadbeef')
AND type IN ('OneType', 'SomeType', 'AnotherType')
ORDER BY id ASC;

with explain plan:

Sort  (cost=105476938.69..105477029.92 rows=36492 width=638)
  Sort Key: id
  ->  Seq Scan on BigTable  (cost=0.00..105468257.46 rows=36492 width=638)
        Filter: (((type)::text = ANY ('{OneType,SomeType,AnotherType}'::text[])) AND (""substring""((hexid)::text, 25) = ANY ('{deadbeef4b00018c09bf5db1,5f478c50deadbeef97039344,5fc0b855a6a8b600deadbeef}'::text[])))"

Query 2 using regular expressions:

SELECT * FROM BigTable
WHERE hexid ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'
AND type ~ '(One|Some|Another)Type'
ORDER BY id ASC;

with explain plan:

Sort  (cost=100022576.55..100022577.16 rows=243 width=638)
  Sort Key: id
  ->  Seq Scan on BigTable  (cost=0.00..100022566.92 rows=243 width=638)
        Filter: (((hexid)::text ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'::text) AND ((type)::text ~ '(One|Some|Another)Type'::text))

Based on the explain plans the cost will be more or less the same.

Query 1 performs a substring, doesn't have pattern matching [fast], but does have WHERE IN clauses [can negatively affect performance].
Query 2 has pattern matching (that takes care of the substring) [slow], but doesn't have WHERE IN clauses [doesn't negatively affect performance].

Getting the data is a once-off process. I don't have the luxury of time running these queries individually to determine which one performs better. I need to make an informed decision on which one to go with.

Which query will give the best performance?
Why does the first explain plan show rows=36492 and the second rows=243?
Is there perhaps another query that would perform much better? Keep in mind that, since it is a once-off query, adding additional indices would not be helpful.

(I'm interested in knowing the answers for PostgreSQL, but I guess it will be similar for other RDBMSs.)

Laurenz Albe · Accepted Answer · 2021-02-16 10:46:53Z

1

The version with IN will be slightly faster, because = ANY is cheaper than ~.

The estimate suggests otherwise, but only because it estimates a different number of result rows.

You can make these queries faster with an index on the more selective condition.

answered Feb 16, 2021 at 10:46

Laurenz Albe

62.8k4 gold badges58 silver badges94 bronze badges

Thanks @laurenz-albe. I suspected as much, due to the short-circuiting behaviour of the IN clause.

HeatZync
– HeatZync

2021-02-16 10:51:24 +00:00
Commented Feb 16, 2021 at 10:51

Add a comment |

Charlieface · Accepted Answer · 2021-02-16 23:35:57Z

1

An ending substring cannot use index lookups because there is no starting key.

To optimize checking an ending substring, create an index over a computed column:

REVERSE(hexid)

Then you can filter:

REVERSE(hexid) LIKE reversedSearchString + '%'

This means a range check can be used against the index, which would not be possible otherwise.

answered Feb 16, 2021 at 23:35

Charlieface

18.1k22 silver badges45 bronze badges

Add a comment |

bobflux · Accepted Answer · 2021-02-16 18:03:23Z

You can get a pretty good estimate with:

CREATE TABLE foo AS SELECT * FROM bigtable LIMIT 1000000;

And run your query on this subset, which should be a lot faster than on the huge table. Since postgres regexps are slow, IN() should win by a large margin. I'll try with a bunch of generated data:

create unlogged table foo (s text not null);
insert into foo select generate_series(1,1000000);
explain analyze select * from foo where substring( s,4 ) in ('123','456','789','052','984','412');
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16758.33 rows=30000 width=6) (actual time=10.366..69.614 rows=5400 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on foo  (cost=0.00..12758.33 rows=12500 width=6) (actual time=6.496..63.969 rows=1800 loops=3)
         Filter: ("substring"(s, 4) = ANY ('{123,456,789,052,984,412}'::text[]))
         Rows Removed by Filter: 331533
 Planning Time: 0.431 ms
 Execution Time: 69.758 ms
(8 lignes)

Temps : 70,641 ms
test=> explain analyze select * from foo where s ~ '\w{4}(123|456|789|052|984|412)';
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..10643.33 rows=100 width=6) (actual time=338.414..339.924 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on foo  (cost=0.00..9633.33 rows=42 width=6) (actual time=335.171..335.171 rows=0 loops=3)
         Filter: (s ~ '\w{4}(123|456|789|052|984|412)'::text)
         Rows Removed by Filter: 333333
 Planning Time: 1.239 ms
 Execution Time: 339.950 ms
(8 lignes)

Yikes, lol. Although, if you have many items in the IN(), forcing it to use a hash would probably be faster:

explain analyze select * from foo join (values ('123'),('456'),('789'),('052'),('984'),('412')) v on substring(s,4)=v.column1;
 Gather  (cost=1000.15..14800.15 rows=30000 width=38) (actual time=10.628..61.902 rows=5400 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Hash Join  (cost=0.15..10800.15 rows=12500 width=38) (actual time=6.636..56.072 rows=1800 loops=3)
         Hash Cond: ("substring"(foo.s, 4) = "*VALUES*".column1)
         ->  Parallel Seq Scan on foo  (cost=0.00..8591.67 rows=416667 width=6) (actual time=0.010..15.803 rows=333333 loops=3)
         ->  Hash  (cost=0.08..0.08 rows=6 width=32) (actual time=0.011..0.011 rows=6 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.08 rows=6 width=32) (actual time=0.003..0.006 rows=6 loops=3)
 Planning Time: 0.179 ms
 Execution Time: 62.063 ms

Filtering by length() of text string first to eliminate the rows that won't match anyway could speed it up too.

Just for the heck of it, clickhouse:

select * from foo where substr(s,10) in ('78593310911','21480020195','22970730260');
6 rows in set. Elapsed: 1.460 sec.
Processed 200.00 million rows, 5.68 GB (136.97 million rows/s., 3.89 GB/s.)

Thanks for the detailed explanation @bobflux

HeatZync
– HeatZync

2021-02-18 12:16:12 +00:00
Commented Feb 18, 2021 at 12:16 — HeatZync
– HeatZync, Commented Feb 18, 2021 at 12:16

Stack Exchange Network

Performance of selecting strings: substring IN vs regular expression

3 Answers 3

Hot Network Questions

Performance of selecting strings: substring IN vs regular expression

3 Answers 3

Related

Hot Network Questions