How to get the weekday from day of month using PySpark

Question

I have a dataframe log_df:

I generate a new dataframe based on the following code:

from pyspark.sql.functions import split, regexp_extract 
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)

the new dataframe is like:

I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.

Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question.

Write a UDF python function that uses the python datetime module and parse out the timestamp column. — OneCricketeer
– OneCricketeer, Commented Aug 13, 2016 at 3:38
@cricket_007 that's exactly what I am asking for help here, thanks. — mdivk
– mdivk, Commented Aug 13, 2016 at 10:04
You could reformat / cast the timestamp column into a Date format that Spark accepts... Then this question practically is a duplicate. And you don't need to regex extract the date string, it has a standard format that you can use datetime.strptime for — OneCricketeer
– OneCricketeer, Commented Aug 13, 2016 at 14:10
@cricket_007 Thanks. Can you provide your full script here? I am really not satisfied with my own solution posted below here — mdivk
– mdivk, Commented Aug 13, 2016 at 18:45

ZygD · Accepted Answer · 2025-06-18 10:38:01Z

45

I suggest a bit different method

from pyspark.sql.functions import date_format
df.select(
    'capturetime',
    date_format('capturetime', 'u').alias('dow_number'),
    date_format('capturetime', 'E').alias('dow_string'))
df3.show()

It gives ...

+--------------------+----------+----------+
|         capturetime|dow_number|dow_string|
+--------------------+----------+----------+
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|

edited Jun 18, 2025 at 10:38

ZygD

24.8k41 gold badges108 silver badges145 bronze badges

answered Jun 7, 2017 at 18:13

Karel Marik

1,12111 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Laurynas G Over a year ago

is 'u' option gone?

MikA Over a year ago

Looks like so, in spark 3.0 'u' is not there, spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. Spark 3.0 is suggesting to set spark.sql.legacy.timeParserPolicy to LEGACY to get old behavior.

Chris Over a year ago

You can use "E" to get the string version of day-of-week, spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Graeme Tate · Accepted Answer · 2021-06-17 11:56:59Z

Since Spark 2.3 you can use the dayofweek function https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.dayofweek.html

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', dayofweek('my_timestamp'))

However this defines the start of the week as a Sunday = 1

If you don't want that, but instead require Monday = 1, then you could do an inelegant fudge like either subtracting 1 day before using the dayofweek function or amend the result such as like this

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', ((dayofweek('my_timestamp')+5)%7)+1)

Sulekha Aloorravi · Accepted Answer · 2018-02-12 08:55:13Z

I did this to get weekdays from date:

def get_weekday(date):
    import datetime
    import calendar
    month, day, year = (int(x) for x in date.split('/'))    
    weekday = datetime.date(year, month, day)
    return calendar.day_name[weekday.weekday()]

spark.udf.register('get_weekday', get_weekday)

Example of usage:

df.createOrReplaceTempView("weekdays")
df = spark.sql("select DateTime, PlayersCount, get_weekday(Date) as Weekday from weekdays")

Manendar · Accepted Answer · 2022-05-08 06:18:55Z

## Here is a potential solution with using UDF which can solve the issue.  

# UDF’s are a black box to PySpark as it can’t apply any optimization and you 
# will lose all the optimization PySpark does on Dataframe. so you should use 
# Spark SQL built-in functions as these functions provide optimization. 
# you should use UDF only when existing built-in SQL function doesn’t have it.


from dateutil.parser import parse

def findWeekday(dt):
    dt = parse(dt)
    return dt.strftime('%A')

weekDayUDF = udf(lambda x:findWeekday(x),StringType())

df.withColumn('weekday',weekDayUDF('ORDERDATE')).show()


+-------+---------------+--------+---------+
|  SALES|      ORDERDATE|MONTH_ID|  weekday|
+-------+---------------+--------+---------+
| 2871.0| 2/24/2003 0:00|       2|   Monday|
| 2765.9|  5/7/2003 0:00|       5|Wednesday|
|3884.34|  7/1/2003 0:00|       7|  Tuesday|
| 3746.7| 8/25/2003 0:00|       8|   Monday|
|5205.27|10/10/2003 0:00|      10|   Friday|
|3479.76|10/28/2003 0:00|      10|  Tuesday|
|2497.77|11/11/2003 0:00|      11|  Tuesday|
|5512.32|11/18/2003 0:00|      11|  Tuesday|
|2168.54| 12/1/2003 0:00|      12|   Monday|
|4708.44| 1/15/2004 0:00|       1| Thursday|
|3965.66| 2/20/2004 0:00|       2|   Friday|

ZygD · Accepted Answer · 2025-06-27 09:04:23Z

Different possibilities to extract weekday from timestamp:

Number

0 = Monday, 1 = Tuesday, ..., 6 = Sunday:
Spark 3.5+ F.weekday('col_name')
Spark 2.4+ F.expr("weekday(col_name)")
1 = Sunday, 2 = Monday, ..., 7 = Saturday:
Spark 2.3+ F.dayofweek('col_name')
1 = Monday, 2 = Tuesday, ..., 7 = Sunday:
Spark 3.5+ F.extract(F.lit('dow_iso'), 'col_name')
Spark 3.0+ F.expr("extract('dow_iso', col_name)")
Aligned day-of-week within a month:
Spark 3.0+ F.date_format('col_name', 'F')

Abbreviation (Mon, Tue, ...)
Spark 4.0+ F.dayname('col_name')
Spark 1.5+ F.date_format('col_name', 'E')

Full name (Monday, Tuesday, ...)
Spark 1.5+ F.date_format('col_name', 'EEEE')

Other locales (e.g., zh = Chinese)
F.to_csv(F.struct(F.to_date('col_name')), {'dateFormat': 'E', 'locale': 'zh'})

General example

from pyspark.sql import functions as F
df = spark.range(1).selectExpr("timestamp'2018-12-31' col_name")

df = df.withColumns({
    'v1': F.weekday('col_name'),
    'v2': F.dayofweek('col_name'),
    'v3': F.extract(F.lit('dow_iso'), 'col_name'),
    'v4': F.date_format('col_name', 'F'),
    'v5': F.dayname('col_name'),
    'v6': F.date_format('col_name', 'E'),
    'v7': F.date_format('col_name', 'EEEE'),
    'v8': F.to_csv(F.struct(F.to_date('col_name')), {'dateFormat': 'E', 'locale': 'zh'}),
})

df.show()
# +-------------------+---+---+---+---+---+---+------+----+
# |           col_name| v1| v2| v3| v4| v5| v6|    v7|  v8|
# +-------------------+---+---+---+---+---+---+------+----+
# |2018-12-31 00:00:00|  0|  2|  1|  3|Mon|Mon|Monday|周一|
# +-------------------+---+---+---+---+---+---+------+----+

The answer

Since the OP has a string (not a timestamp), as an intermediary step, just for the OP, we add a couple of lines in order to return a timestamp:

ts_string = F.regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}) -\d{4}]', 1)
ts_parsed = F.to_timestamp(ts_string, 'dd/MMM/yyyy:HH:mm:ss')

Using the timestamp, we can choose the most suited function.

from pyspark.sql import functions as F
df = spark.createDataFrame([('[01/Jul/1995:00:00:01 -0400]',)], ['value'])

ts_string = F.regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}) -\d{4}]', 1)
ts_parsed = F.to_timestamp(ts_string, 'dd/MMM/yyyy:HH:mm:ss')

df = df.withColumns({
    'v1': F.weekday(ts_parsed),
    'v2': F.dayofweek(ts_parsed),
    'v3': F.extract(F.lit('dow_iso'), ts_parsed),
    'v4': F.date_format(ts_parsed, 'F'),
    'v5': F.dayname(ts_parsed),
    'v6': F.date_format(ts_parsed, 'E'),
    'v7': F.date_format(ts_parsed, 'EEEE'),
    'v8': F.to_csv(F.struct(F.to_date(ts_parsed)), {'dateFormat': 'E', 'locale': 'zh'}),
})

df.show()
# +--------------------+---+---+---+---+---+---+--------+----+
# |               value| v1| v2| v3| v4| v5| v6|      v7|  v8|
# +--------------------+---+---+---+---+---+---+--------+----+
# |[01/Jul/1995:00:0...|  5|  7|  6|  1|Sat|Sat|Saturday|周六|
# +--------------------+---+---+---+---+---+---+--------+----+

mdivk · Accepted Answer · 2016-08-13 10:57:19Z

I finally resolved the question myself, here is the complete solution:

import date_format, datetime, DataType
first, modify the regexp to extract 01/Jul/1995
convert 01/Jul/1995 to DateType using func
create a udf dayOfWeek to get the week day in brief format (Mon, Tue,...)
using the udf to convert the DateType 01/Jul/1995 to weekday which is Sat

I am not satisfied with my solution as it seems to be so zig-zag, it would be appreciated if anyone can come up with a more elegant solution, thank you in advance.

Collectives™ on Stack Overflow

How to get the weekday from day of month using PySpark

6 Answers 6

3 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related