1

I have a array with JSON strings and each JSON has a different key. Also, the size of the array is not fixed and may increase or decrease in occurrences. An example of this is below :

"[{"key1": 10},{"key2":"val2"}]"

I have this as a column in a delta table which is the data which has been ingested from the vendor.

My question is how do we convert this to a format so I can easily access the keys and values for each occurrence using only Databricks SQL? If the json has a fixed key, it can easily be achieved by casting into a STRUCT. But since the keys are different, I cannot use that.

I tried using explode which gave me 2 rows, each with a json string. But again, How can I get the keys and values from each of those json strings using just Databricks SQL? Also, in this case, the Integer values are also converted to strings which means I lose the datatype. What I'm looking for is a VARIANT version of this.

[{"key_name":"key1","key_value":10},{"key_name":"key2","key_value":"val2"}]

What I have done so far :

%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_array1 AS
SELECT '[{"key1":10},{"key2":"val2"}]' as properties;

CREATE OR REPLACE TEMPORARY VIEW tmp_v1 AS
SELECT from_json(properties, 'ARRAY<STRING>') as col FROM tmp_array1;
SELECT * FROM tmp_v1;

Gives me

enter image description here

And then

%sql
SELECT EXPLODE(col) FROM tmp_v1;

Gives me

enter image description here

So, I can access each occurences but how I access each value when I do not know what the key name may be?

2 Answers 2

1

I'm not exactly sure what you're looking for, but here is some examples of transforming the data in such a way that at the end I'm working with rows of key and value columns.

;WITH json_parsed_as_array_of_maps as 
(
  SELECT 
    parse_json('[{"key1":10},{"key2":"val2"}]')::ARRAY<MAP<STRING,STRING>> as arr
)--select * from json_parsed_as_array_of_maps
, unified_map AS
(
  SELECT 
    map_from_arrays
    (
      flatten(transform(arr, x -> MAP_KEYS(x))),
      flatten(transform(arr, x -> MAP_VALUES(x)))
    ) as one_map
  FROM json_parsed_as_array_of_maps
)--select * from unified_map
, exploded as
(
  SELECT
    explode(map_entries(one_map)) as pair 
  FROM unified_map
)--select * from exploded
SELECT
  pair.key, 
  pair.value 
FROM exploded

/*
key value
----------
key1    10
key2    val2
*/

and another way, this time presenting the combination of all the key/value pairs in a JSON array:

;WITH j as 
(
  SELECT 
    parse_json('[{"key1":10},{"key2":"val2"},{"key3":null}]')::ARRAY<STRING> as arr
)--select * from j
, unified_array(array_string) as
(
  SELECT 
    '['||
      ARRAY_JOIN
      (
        COLLECT_SET
        (
          --somewhat dirty string manipulation here (flawed if key or value string contains a colon)
          '{"key":'||REPLACE(SUBSTRING(item, 2, LEN(item)-2), ':', ',"value":')||'}'
        ), 
        ','
      )
    ||']' 
    FROM (SELECT EXPLODE(arr) as item FROM j) 
)--SELECT * FROM unified_array
SELECT 
  *
  --array_string:[2].key, array_string:[1].value
FROM unified_array

--[{"key":"key2","value":"val2"},{"key":"key3","value":null},{"key":"key1","value":10}]
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, but the problem with this you have converted the integer of 10 to a string. We have therefore lost the data type. So we cannot put the values in a single column. I need the output in a variant format which maintains the datatype of the values.
It would be good to see what you envision the final output looking like. Perhaps add a query of the solution-CTE you want and show what the outputs would be.
I added another query option to the end of my answer. This preserves the variant format, though it does do a little dirty string manipulation.
Thanks for this but this produces a string. However, if I just run a parse_json() on your string, I'm able to get the VARIANT version through which I can access any keyname and value. I've added another optional answer which requires Unity catalog. But I will mark your answer as correct since there is no dependency on Unity Catalog. Thanks.
0

I was able to achieve this using Databricks functions using Python. To use this, Unity catalog is mandatory.

First I create my dummy test data

%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_array1 AS
    SELECT '[{"key1":10},{"key2":"val2"}]' as properties;

SELECT * FROM tmp_array1;

enter image description here

Create a function

%sql
CREATE OR REPLACE TEMPORARY FUNCTION test_func(s STRING)
  RETURNS STRING
  LANGUAGE PYTHON
  AS $$
    import json
    def flattenjson1(inputjson):
      dict = json.loads(inputjson)
      out= []
      for i in dict:
          for key, value in i.items():  
              out.append({"key_name":key, "key_value":value})
      #print(out)
      return json.dumps(out)
    return flattenjson1(s) if s else None
  $$

Now I can parse the flattened string.

%sql
CREATE OR REPLACE TEMPORARY VIEW view3 AS 
SELECT *, test_func(properties) as flatten from tmp_array1;

SELECT *,parse_json(flatten) as variant FROM view3;

enter image description here

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.