Split Complex String in PySpark Dataframe Column

Question

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:

id       addresses
1       [{"city":null,"state":null,"street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]

I want to transform it as below:

id	city	state	street	postalCode	country
1	null	null	123, ABC St, ABC Square	11111	USA
1	Dallas	TX	456, DEF Plaza, Test St	99999	USA

Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.

I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?

wwnde · Accepted Answer · 2022-11-20 10:12:28Z

1

#Data

from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
                         ('id','addresses'))
df.show(truncate=False)

#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema

##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select

df3.select('id','test_col.*').show()

+---+--------+-------+----------+-----+------------------------+
|id |city    |country|postalCode|state|street                  |
+---+--------+-------+----------+-----+------------------------+
|1  |New York|USA    |11111     |NY   |123, ABC St, ABC  Square|
+---+--------+-------+----------+-----+------------------------+

answered Nov 20, 2022 at 10:12

wwnde

26.7k6 gold badges20 silver badges37 bronze badges

1

clever approach to use json to parse the string. ++ for that. Saved as bookmark for future use.
– Azhar Khan
Commented Nov 20, 2022 at 13:03
@wwnde: Thanks for your solution! I tried this but in my final output every address part (street/postalCode etc.) is appearing as null. I tried to debug it and printed the RDD record which looks fine. So something is going wrong when converting the JSON column it seems. If it helps, here's what the RDD output looks like: first : [ { "city": null, "state": null, "street": "123, ABC St, ABC Square", "postalCode": "11111", "country": "USA" } ] City and state are null in the source itself but other fields are populated as can be seen in RDD.
– Jatin
Commented Nov 20, 2022 at 22:51
I removed the corner brackets [] and also made "NY" and "TX" to make it a true json string when read. Does that help?
– wwnde
Commented Nov 20, 2022 at 23:08
Sorry not sure I understand. Also addresses column in source is already a true JSON string. My bad, I missed the quotes around NY and TX in my question. The data looks good until RDD but all columns become null after that. Source data: id addresses 1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"}] RDD output: [ { "city": null, "state": null, "street": "123, ABC St, ABC Square", "postalCode": "11111", "country": "USA" } ] Final output: id city country postalCode state street 1 null null null null null
– Jatin
Commented Nov 21, 2022 at 0:55
Are you able to post a datframe.consteuctor that would result into a df from which you want the fields extracted?
– wwnde
Commented Nov 21, 2022 at 2:46

| Show 2 more comments

Collectives™ on Stack Overflow

Split Complex String in PySpark Dataframe Column

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related