0

I'm struggling with an issue of exploding a json array column with Spark.

I have a dataframe that looks like:

+------+------------------------------------------------------------------------+
|id    |struct                                                                  |
+------+------------------------------------------------------------------------+
|  1   |  [{_name: BankAccount, _value: 123456}, {_name: Balance, _value: 500$}]|
|  2   |  [{_name: BankAccount, _value: 098765}, {_name: Balance, _value: 100$}]|
|  3   |  [{_name: BankAccount, _value: 135790}, {_name: Balance, _value: 200$}]|
+------+------------------------------------------------------------------------+

and I want it to be as

+------+------------+--------+
|id    | BankAccount| Balance|
+------+------------+--------+
|  1   |   123456   | 500$   |
|  2   |   098765   | 100$   |
|  3   |   135790   | 200$   |
+------+------------+--------+

Of course it is not really exploding but I couldn't be remotely close to the outcome of what I need.

Thanks for your help!

2
  • 1
    Can you share print schema of source dataframe Commented Jul 6, 2020 at 10:41
  • What did you try so far, to achieve your goal? Commented Jul 6, 2020 at 11:35

1 Answer 1

2

Check below code.

From your sample data instead of struct I have used data for simplicity.. :)

 val df = Seq((1,"""[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]"""),(2,"""[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]"""),(3,"""[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]""")).toDF("id","data")

Print Schema of data

scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)

Showing Sample Data

scala> df.show(false)
+---+--------------------------------------------------------------------------------+
|id |data                                                                            |
+---+--------------------------------------------------------------------------------+
|1  |[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]|
|2  |[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]|
|3  |[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]|
+---+--------------------------------------------------------------------------------+

Creating schema for json data

scala> val schema = ArrayType(MapType(StringType,StringType))

Using explode, groupBy & pivot to get expected result.

Note - You may need to tune below code little bit as per your requirement.

scala> 

df
.withColumn("data",explode(from_json($"data",schema)))
.select($"id",struct($"data"("_name").as("key"),$"data"("_value").as("value")).as("data"))
.select($"id",$"data.*")
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
.select("id","BankAccount","Balance")
.orderBy($"id".asc)
.show(false)

Final Result

+---+-----------+-------+
|id |BankAccount|Balance|
+---+-----------+-------+
|1  |123456     |500$   |
|2  |098765     |100$   |
|3  |135790     |200$   |
+---+-----------+-------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.