3

My original dataframe has the following columns -

enter image description here

I want to split the json_result column into separate columns like this:

enter image description here

I tried using json_normalise, but couldn't apply on the entire dataframe. Can someone share the code to transform the entire dataframe?

Adding the snippet that I have tried -

raw_data = [{'id': 1, 'name': 'NATALIE', 'json_result': '{"0": {"_source": {"person_id": 101, "firstname": "NATALIE", "lastname": "OSHO", "city_name": "WESTON"}}}'}, \
        {'id': 2, 'name': 'MARK', 'json_result': '{"0": {"_source": {"person_id": 102, "firstname": "MARK", "lastname": "BROWN", "city_name": "NEW YORK"}}}'}, \
        {'id': 3, 'name': 'NANCY', 'json_result': '{"0": {"_source": {"person_id": 103, "firstname": "NANCY", "lastname": "GATES", "city_name": "LA"}}}'}]

df = pd.DataFrame.from_dict(raw_data)

splitted_df = pd.json_normalize(df['json_result'][0])

Error Message:

AttributeError: 'str' object has no attribute 'values'

2
  • share a few line of your data. The solution is to use json_normalize but it is hard to show you without data. Commented Nov 19, 2020 at 17:25
  • @SergedeGossondeVarennes I've added the data and the snippet I used to split the data. Could you please take a look? Commented Nov 19, 2020 at 17:45

2 Answers 2

0

Spark version to convert the json into columns.

raw_data = \
    [{'id': 1, 'name': 'NATALIE', 'json_result': '{"0": {"_source": {"person_id": 101, "firstname": "NATALIE", "lastname": "OSHO", "city_name": "WESTON"}}}'}, \
     {'id': 2, 'name': 'MARK', 'json_result': '{"0": {"_source": {"person_id": 102, "firstname": "MARK", "lastname": "BROWN", "city_name": "NEW YORK"}}}'}, \
     {'id': 3, 'name': 'NANCY', 'json_result': '{"0": {"_source": {"person_id": 103, "firstname": "NANCY", "lastname": "GATES", "city_name": "LA"}}}'}]
df = spark.createDataFrame(raw_data)
json_schema = spark.read.json(df.rdd.map(lambda rec: rec.json_result)).schema
df = df.withColumn('json', F.from_json(F.col('json_result'), json_schema)) \
    .select("id", "name", "json.0._source.*")
df.show()
Sign up to request clarification or add additional context in comments.

Comments

0

I'd like for somebody experienced in pandas to show me a better way but this is what I came up with. (I'm still learning pandas.)

import pandas as pd
import json

raw_data = [{'id': 1, 'name': 'NATALIE', 'json_result': '{"0": {"_source": {"person_id": 101, "firstname": "NATALIE", "lastname": "OSHO", "city_name": "WESTON"}}}'}, \
        {'id': 2, 'name': 'MARK', 'json_result': '{"0": {"_source": {"person_id": 102, "firstname": "MARK", "lastname": "BROWN", "city_name": "NEW YORK"}}}'}, \
        {'id': 3, 'name': 'NANCY', 'json_result': '{"0": {"_source": {"person_id": 103, "firstname": "NANCY", "lastname": "GATES", "city_name": "LA"}}}'}]

df = pd.DataFrame.from_dict(raw_data)
ser = df['json_result'].apply(lambda s: pd.json_normalize(json.loads(s)))

a = df.drop(columns=['json_result'])
b = pd.concat(list(ser), ignore_index=True)
c = a.join(b)

import sys
c.to_csv(sys.stdout, index=False)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.