Regex help in Snowflake

Question

I am trying to write a regex that can replace '[', ']' and '].' with '_' and in cases where ']' is the last character it should be replaced with '' but I am struggling to come up with a regex for it that works in all cases because Snowflake does not support lookahead & lookbehind.

My question is, has anyone tried/achieved to do this before? or is it impossible and I should just give up??

Eg:

look_behind[0] --> look_behind_0
look_behind[0]_positive[1] --> look_behind_0_positive_1

Hi, and welcome to dba.se! You have in your two record sample, the datum look_behind[0]_positive[1] which you want converted to look_behind_0_positive_1. You only have one underscore after the 0 - should that be two underscores or should it be only one? This changes the question greatly! — Vérace
– Vérace, Commented Apr 15, 2024 at 22:15

Vérace · Accepted Answer · 2024-04-15 23:40:14Z

As far as I can see, you have no need to use a regex at all!

Using a search engine, I found that the functions LEFT() (manual for all string functions), RIGHT() and TRANSLATE() have the same signatures (mostly) in Snowflake (1, 2 & 3) as in PostgreSQL. All of the code below is available on the fiddle here.

Populate with your sample data with extra records:

INSERT INTO tab VALUES 
('look_behind[0]'),
('look_behind[0]_positive[1]'),
('No_[brackets]_[at]_end'),      -- two added records for testing
('[]][][_sderfrx÷=/_[]ddf]');

Then, we do this:

--
-- Works where the string <underscore><left-bracket> (_[) OR the string 
-- <right-bracket><underscore> (]_) are to be replaced by two underscores.
--
-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- It is worth noting that the order of the replacements is important as
-- the operations are not commutative - i.e. x R y <> y R x (not necessarily!).
--
--

SELECT
  str,
  LENGTH(str) AS len,
  TRANSLATE
  (
    CASE
      WHEN (RIGHT(str, 1) = ']') THEN RIGHT(str, -1)
      ELSE str
    END,
    '[]',
    '__'          -- two underscores - one for each bracket (left & right)
  ) AS f_str,
  REGEXP_COUNT
  (
    str,
    '[\[|\]]',
    1,        -- default anyway - not required
    'i'       -- not relevant for non-alpha characters
  ) AS f_str_chars_cnt
FROM
  tab;

Result:

str                       len f_str                 f_str_chars_cnt
look_behind[0]             14 look_behind_0                     2
look_behind[0]_positive[1] 26 look_behind_0__positive_1         4
No_[brackets]_[at]_end     22 No__brackets___at__end            4
[]][][_sderfrx÷=/_[]ddf]   24 _______sderfrx÷=/___ddf           9

You could also do this:

--
-- Works where the string <underscore><left-bracket> (_[) OR the string 
-- <right-bracket><underscore> (]_) are to be replaced by one underscore. 
--
-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- It is worth noting that the order of the replacements is important as
-- the operations are not commutative - i.e. x R y <> y R x (not necessarily!).
--
--


SELECT
  str,
  TRANSLATE
  (
    CASE
      WHEN
        RIGHT  (REPLACE(REPLACE(str, '_[', '_'), ']_', '_'),  1) = ']' THEN 
          LEFT (REPLACE(REPLACE(str, '_[', '_'), ']_', '_'), -1)
        ELSE
          str
      END,
    '[]',
    '__'
  )
FROM 
  tab;

Result (the reader can count the underscores!):

str                         translate
look_behind[0]              look_behind_0
look_behind[0]_positive[1]  look_behind_0_positive_1
No_[brackets]_[at]_end      No__brackets___at__end
[]][][_sderfrx÷=/_[]ddf]    _______sderfrx÷=/__ddf

Regexes are expensive relative to simple string functions. If you really want a regular expression, you can do the following (REGEXP_REPLACE() also has a similar signature in PostgreSQL as in Snowflake):

-- If the last character of the string is <right-bracket> (]), it is replaced
-- by the empty string.
--
-- Works where the string <underscore><left-bracket> (_[) AND the string 
-- <right-bracket><underscore> (]_) are to be replaced by two underscores (__).
--
-- The order of the replacements for the desired outcome in terms of what character/
-- string is replaced first is left up to the reader.
-- 

SELECT
  REGEXP_REPLACE
  (
    REGEXP_REPLACE
    (
      str, 
      '(.*)]$',  -- replace with pattern 
      '\1'
    ),
    '[\[\]]',
    '_',      
    'g'       -- "global" flag to replace all occurrences
  ) 
FROM
  tab;

Result: same as first (TRANSLATE()) snippet above.

Performance analysis (see end of fiddle):

The entire results of the fiddle aren't included here, but just to see the times is interesting.

TRANSLATE()          - Execution Time: 0.037 ms
REGEXP_REPLACE()     - Execution Time: 0.100 ms
REPLACE(...REPLACE() - Execution Time: 0.027 ms

The time taken for the two string function solutions is roughly 33% of the time taken for the regular expression solution. This is why it's always worth checking to see if an ordinary string function will suffice in cases such as this.

Stack Exchange Network

Regex help in Snowflake

1 Answer 1

Hot Network Questions

Regex help in Snowflake

1 Answer 1

Related

Hot Network Questions