PySpark: String Split / Expand Columns

Question

While there are many threads on the topic, the examples are not working for me.

Consider the following:

df = spark.createDataFrame(sc.parallelize([['1', 'SN4.F01C04-AM428.1_31']]), ["col1", "col2"])

+----+--------------------+
|col1|                col2|
+----+--------------------+
|   1|SN4.F01C04-AM428....|
+----+--------------------+

What I tried:

display(df.select(F.split(df.col2, '.', 1).alias('s')))
+--------------------+
|                   s|
+--------------------+
|[SN4.F01C04-AM428...|
+--------------------+

Expected:

expected = spark.createDataFrame(sc.parallelize([['1', 'SN4', 'F01C04-AM428', '1_31']]), ["col1", "col2", "col3", "col4"])

+----+----+------------+----+
|col1|col2|        col3|col4|
+----+----+------------+----+
|   1| SN4|F01C04-AM428|1_31|
+----+----+------------+----+

过过招 · Accepted Answer · 2022-06-07 01:08:03Z

1

. is a special character that needs to be escaped when used in the split function. The requirement is to convert the array column into multiple columns, which can be implemented as follows:

df = df.select('col1', *[F.split('col2', '\\.').getItem(i).alias(f'col{i + 2}') for i in range(3)])

answered Jun 7, 2022 at 1:08

过过招

4,2792 gold badges6 silver badges13 bronze badges

Add a comment |

Collectives™ on Stack Overflow

PySpark: String Split / Expand Columns

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related