1

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:

id       addresses
1       [{"city":null,"state":null,"street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]

I want to transform it as below:

id city state street postalCode country
1 null null 123, ABC St, ABC Square 11111 USA
1 Dallas TX 456, DEF Plaza, Test St 99999 USA

Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.

I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?

1 Answer 1

1

#Data

from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
                         ('id','addresses'))
df.show(truncate=False)

#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema

##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select

df3.select('id','test_col.*').show()

+---+--------+-------+----------+-----+------------------------+
|id |city    |country|postalCode|state|street                  |
+---+--------+-------+----------+-----+------------------------+
|1  |New York|USA    |11111     |NY   |123, ABC St, ABC  Square|
+---+--------+-------+----------+-----+------------------------+
7
  • 1
    clever approach to use json to parse the string. ++ for that. Saved as bookmark for future use.
    – Azhar Khan
    Commented Nov 20, 2022 at 13:03
  • @wwnde: Thanks for your solution! I tried this but in my final output every address part (street/postalCode etc.) is appearing as null. I tried to debug it and printed the RDD record which looks fine. So something is going wrong when converting the JSON column it seems. If it helps, here's what the RDD output looks like: first : [ { "city": null, "state": null, "street": "123, ABC St, ABC Square", "postalCode": "11111", "country": "USA" } ] City and state are null in the source itself but other fields are populated as can be seen in RDD.
    – Jatin
    Commented Nov 20, 2022 at 22:51
  • I removed the corner brackets [] and also made "NY" and "TX" to make it a true json string when read. Does that help?
    – wwnde
    Commented Nov 20, 2022 at 23:08
  • Sorry not sure I understand. Also addresses column in source is already a true JSON string. My bad, I missed the quotes around NY and TX in my question. The data looks good until RDD but all columns become null after that. Source data: id addresses 1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"}] RDD output: [ { "city": null, "state": null, "street": "123, ABC St, ABC Square", "postalCode": "11111", "country": "USA" } ] Final output: id city country postalCode state street 1 null null null null null
    – Jatin
    Commented Nov 21, 2022 at 0:55
  • Are you able to post a datframe.consteuctor that would result into a df from which you want the fields extracted?
    – wwnde
    Commented Nov 21, 2022 at 2:46

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.