0

While there are many threads on the topic, the examples are not working for me.

Consider the following:

df = spark.createDataFrame(sc.parallelize([['1', 'SN4.F01C04-AM428.1_31']]), ["col1", "col2"])

+----+--------------------+
|col1|                col2|
+----+--------------------+
|   1|SN4.F01C04-AM428....|
+----+--------------------+

What I tried:

display(df.select(F.split(df.col2, '.', 1).alias('s')))
+--------------------+
|                   s|
+--------------------+
|[SN4.F01C04-AM428...|
+--------------------+

Expected:

expected = spark.createDataFrame(sc.parallelize([['1', 'SN4', 'F01C04-AM428', '1_31']]), ["col1", "col2", "col3", "col4"])

+----+----+------------+----+
|col1|col2|        col3|col4|
+----+----+------------+----+
|   1| SN4|F01C04-AM428|1_31|
+----+----+------------+----+

1 Answer 1

1

. is a special character that needs to be escaped when used in the split function. The requirement is to convert the array column into multiple columns, which can be implemented as follows:

df = df.select('col1', *[F.split('col2', '\\.').getItem(i).alias(f'col{i + 2}') for i in range(3)])

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.