How to split a json string column in pandas/spark dataframe?

Question

My original dataframe has the following columns -

I want to split the json_result column into separate columns like this:

I tried using json_normalise, but couldn't apply on the entire dataframe. Can someone share the code to transform the entire dataframe?

Adding the snippet that I have tried -

raw_data = [{'id': 1, 'name': 'NATALIE', 'json_result': '{"0": {"_source": {"person_id": 101, "firstname": "NATALIE", "lastname": "OSHO", "city_name": "WESTON"}}}'}, \
        {'id': 2, 'name': 'MARK', 'json_result': '{"0": {"_source": {"person_id": 102, "firstname": "MARK", "lastname": "BROWN", "city_name": "NEW YORK"}}}'}, \
        {'id': 3, 'name': 'NANCY', 'json_result': '{"0": {"_source": {"person_id": 103, "firstname": "NANCY", "lastname": "GATES", "city_name": "LA"}}}'}]

df = pd.DataFrame.from_dict(raw_data)

splitted_df = pd.json_normalize(df['json_result'][0])

Error Message:

AttributeError: 'str' object has no attribute 'values'

share a few line of your data. The solution is to use json_normalize but it is hard to show you without data. — Serge de Gosson de Varennes
– Serge de Gosson de Varennes, Commented Nov 19, 2020 at 17:25
@SergedeGossondeVarennes I've added the data and the snippet I used to split the data. Could you please take a look? — Teresa
– Teresa, Commented Nov 19, 2020 at 17:45

Srinivas · Accepted Answer · 2020-11-19 18:39:25Z

Spark version to convert the json into columns.

raw_data = \
    [{'id': 1, 'name': 'NATALIE', 'json_result': '{"0": {"_source": {"person_id": 101, "firstname": "NATALIE", "lastname": "OSHO", "city_name": "WESTON"}}}'}, \
     {'id': 2, 'name': 'MARK', 'json_result': '{"0": {"_source": {"person_id": 102, "firstname": "MARK", "lastname": "BROWN", "city_name": "NEW YORK"}}}'}, \
     {'id': 3, 'name': 'NANCY', 'json_result': '{"0": {"_source": {"person_id": 103, "firstname": "NANCY", "lastname": "GATES", "city_name": "LA"}}}'}]
df = spark.createDataFrame(raw_data)
json_schema = spark.read.json(df.rdd.map(lambda rec: rec.json_result)).schema
df = df.withColumn('json', F.from_json(F.col('json_result'), json_schema)) \
    .select("id", "name", "json.0._source.*")
df.show()

user5386938 · Accepted Answer · 2020-11-20 16:43:42Z

I'd like for somebody experienced in pandas to show me a better way but this is what I came up with. (I'm still learning pandas.)

import pandas as pd
import json

raw_data = [{'id': 1, 'name': 'NATALIE', 'json_result': '{"0": {"_source": {"person_id": 101, "firstname": "NATALIE", "lastname": "OSHO", "city_name": "WESTON"}}}'}, \
        {'id': 2, 'name': 'MARK', 'json_result': '{"0": {"_source": {"person_id": 102, "firstname": "MARK", "lastname": "BROWN", "city_name": "NEW YORK"}}}'}, \
        {'id': 3, 'name': 'NANCY', 'json_result': '{"0": {"_source": {"person_id": 103, "firstname": "NANCY", "lastname": "GATES", "city_name": "LA"}}}'}]

df = pd.DataFrame.from_dict(raw_data)
ser = df['json_result'].apply(lambda s: pd.json_normalize(json.loads(s)))

a = df.drop(columns=['json_result'])
b = pd.concat(list(ser), ignore_index=True)
c = a.join(b)

import sys
c.to_csv(sys.stdout, index=False)

Collectives™ on Stack Overflow

How to split a json string column in pandas/spark dataframe?

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related