0

I would need to process a json file with the following schema:

root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)

When I load it into a DataFrame there is only one Row, but that row contains plenty of data and event elements in the Payload array. (One element has either data or event but never both)

I would like to get all the events so I could perform some further operations on them or maybe loading them later in a DB table etc... In order to do that I will need all the elements of Payload that has Event and I won't need the one that only has "Data". Best would be to have a DataFrame in the end that contains rows with only the members of Event...

Unfortunately when I tried something like this:

df.select("Payload.Event") or df.select(`Payload`).filter(...)

Then it was still filtering on the root but since there is only one row in the DataFrame that was not very helpful. How can I filter the internal array, and get its elements as a separate Dataframe?

Sample json:

{
    "Header": {
        "Version": {
            "vfield": "0.6"
        },
        "Format": "DEFAULT"
    },
    "Payload": [
        {"Data": [
            [0, 1, 2],
            [5, 6]
        ]},

        {"Event": {
            "eventName" : "event1",
            "eventCount": 123
        }},
        {"Event": {
            "eventName" : "event2",
            "eventCount": 124
        }},
        { "Data": [
            [5,8],
            [1,2,6]
        ] }
    ]        
}    
2
  • can you update your sample json data? Commented Jul 18, 2020 at 13:54
  • Have you tried searching SO for "json explode"? For example dataframe Spark scala explode json array Commented Jul 18, 2020 at 14:22

1 Answer 1

2

Because Payload is of type array, If you access anything without explode will give you result of type array

Change df.select("Payload.Event") to df.withColumn("Payload",explode("Payload")).select("Payload.Event")

Check below code.

scala> df.printSchema
root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)


scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event").printSchema
root
 |-- Event: struct (nullable = true)
 |    |-- eventCount: long (nullable = true)
 |    |-- eventName: string (nullable = true)


scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event.*").printSchema
root
 |-- eventCount: long (nullable = true)
 |-- eventName: string (nullable = true)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you that did the trick, additionally I added the where so the nulls disappeared as well: df.withColumn("Payload",explode(df.col("Payload"))).where(!isnull($"Payload.Event")).select("Payload.Event.*")
Please accept or upvote if this solution helps you .. :)
Hmm. already accepted it, not sure why the tick disappeared. I try again :) Unfortunately I'm a newbie, I can't upvote :(
let me upvote on OP's behalf. :) I indeed like this solution @Srinivas

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.