12

I have a dataframe log_df: enter image description here

I generate a new dataframe based on the following code:

from pyspark.sql.functions import split, regexp_extract 
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)

the new dataframe is like: enter image description here

I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.

Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question.

6
  • Write a UDF python function that uses the python datetime module and parse out the timestamp column. Commented Aug 13, 2016 at 3:38
  • 2
    Possible duplicate of How to get day of week in SparkSQL? Commented Aug 13, 2016 at 3:39
  • @cricket_007 that's exactly what I am asking for help here, thanks. Commented Aug 13, 2016 at 10:04
  • You could reformat / cast the timestamp column into a Date format that Spark accepts... Then this question practically is a duplicate. And you don't need to regex extract the date string, it has a standard format that you can use datetime.strptime for Commented Aug 13, 2016 at 14:10
  • @cricket_007 Thanks. Can you provide your full script here? I am really not satisfied with my own solution posted below here Commented Aug 13, 2016 at 18:45

6 Answers 6

45

I suggest a bit different method

from pyspark.sql.functions import date_format
df.select(
    'capturetime',
    date_format('capturetime', 'u').alias('dow_number'),
    date_format('capturetime', 'E').alias('dow_string'))
df3.show()

It gives ...

+--------------------+----------+----------+
|         capturetime|dow_number|dow_string|
+--------------------+----------+----------+
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
|2017-06-05 10:05:...|         1|       Mon|
Sign up to request clarification or add additional context in comments.

3 Comments

is 'u' option gone?
Looks like so, in spark 3.0 'u' is not there, spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. Spark 3.0 is suggesting to set spark.sql.legacy.timeParserPolicy to LEGACY to get old behavior.
You can use "E" to get the string version of day-of-week, spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
17

Since Spark 2.3 you can use the dayofweek function https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.dayofweek.html

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', dayofweek('my_timestamp'))

However this defines the start of the week as a Sunday = 1

If you don't want that, but instead require Monday = 1, then you could do an inelegant fudge like either subtracting 1 day before using the dayofweek function or amend the result such as like this

from pyspark.sql.functions import dayofweek
df.withColumn('day_of_week', ((dayofweek('my_timestamp')+5)%7)+1)

Comments

4

I did this to get weekdays from date:

def get_weekday(date):
    import datetime
    import calendar
    month, day, year = (int(x) for x in date.split('/'))    
    weekday = datetime.date(year, month, day)
    return calendar.day_name[weekday.weekday()]

spark.udf.register('get_weekday', get_weekday)

Example of usage:

df.createOrReplaceTempView("weekdays")
df = spark.sql("select DateTime, PlayersCount, get_weekday(Date) as Weekday from weekdays")

enter image description here

Comments

0
## Here is a potential solution with using UDF which can solve the issue.  

# UDF’s are a black box to PySpark as it can’t apply any optimization and you 
# will lose all the optimization PySpark does on Dataframe. so you should use 
# Spark SQL built-in functions as these functions provide optimization. 
# you should use UDF only when existing built-in SQL function doesn’t have it.


from dateutil.parser import parse

def findWeekday(dt):
    dt = parse(dt)
    return dt.strftime('%A')

weekDayUDF = udf(lambda x:findWeekday(x),StringType())

df.withColumn('weekday',weekDayUDF('ORDERDATE')).show()


+-------+---------------+--------+---------+
|  SALES|      ORDERDATE|MONTH_ID|  weekday|
+-------+---------------+--------+---------+
| 2871.0| 2/24/2003 0:00|       2|   Monday|
| 2765.9|  5/7/2003 0:00|       5|Wednesday|
|3884.34|  7/1/2003 0:00|       7|  Tuesday|
| 3746.7| 8/25/2003 0:00|       8|   Monday|
|5205.27|10/10/2003 0:00|      10|   Friday|
|3479.76|10/28/2003 0:00|      10|  Tuesday|
|2497.77|11/11/2003 0:00|      11|  Tuesday|
|5512.32|11/18/2003 0:00|      11|  Tuesday|
|2168.54| 12/1/2003 0:00|      12|   Monday|
|4708.44| 1/15/2004 0:00|       1| Thursday|
|3965.66| 2/20/2004 0:00|       2|   Friday|
 

Comments

0

Different possibilities to extract weekday from timestamp:

Number

  • 0 = Monday, 1 = Tuesday, ..., 6 = Sunday:
    Spark 3.5+ F.weekday('col_name')
    Spark 2.4+ F.expr("weekday(col_name)")

  • 1 = Sunday, 2 = Monday, ..., 7 = Saturday:
    Spark 2.3+ F.dayofweek('col_name')

  • 1 = Monday, 2 = Tuesday, ..., 7 = Sunday:
    Spark 3.5+ F.extract(F.lit('dow_iso'), 'col_name')
    Spark 3.0+ F.expr("extract('dow_iso', col_name)")

  • Aligned day-of-week within a month:
    Spark 3.0+ F.date_format('col_name', 'F')

Abbreviation (Mon, Tue, ...)
Spark 4.0+ F.dayname('col_name')
Spark 1.5+ F.date_format('col_name', 'E')

Full name (Monday, Tuesday, ...)
Spark 1.5+ F.date_format('col_name', 'EEEE')

Other locales (e.g., zh = Chinese)
F.to_csv(F.struct(F.to_date('col_name')), {'dateFormat': 'E', 'locale': 'zh'})


General example

from pyspark.sql import functions as F
df = spark.range(1).selectExpr("timestamp'2018-12-31' col_name")

df = df.withColumns({
    'v1': F.weekday('col_name'),
    'v2': F.dayofweek('col_name'),
    'v3': F.extract(F.lit('dow_iso'), 'col_name'),
    'v4': F.date_format('col_name', 'F'),
    'v5': F.dayname('col_name'),
    'v6': F.date_format('col_name', 'E'),
    'v7': F.date_format('col_name', 'EEEE'),
    'v8': F.to_csv(F.struct(F.to_date('col_name')), {'dateFormat': 'E', 'locale': 'zh'}),
})

df.show()
# +-------------------+---+---+---+---+---+---+------+----+
# |           col_name| v1| v2| v3| v4| v5| v6|    v7|  v8|
# +-------------------+---+---+---+---+---+---+------+----+
# |2018-12-31 00:00:00|  0|  2|  1|  3|Mon|Mon|Monday|周一|
# +-------------------+---+---+---+---+---+---+------+----+

The answer

Since the OP has a string (not a timestamp), as an intermediary step, just for the OP, we add a couple of lines in order to return a timestamp:

ts_string = F.regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}) -\d{4}]', 1)
ts_parsed = F.to_timestamp(ts_string, 'dd/MMM/yyyy:HH:mm:ss')

Using the timestamp, we can choose the most suited function.

from pyspark.sql import functions as F
df = spark.createDataFrame([('[01/Jul/1995:00:00:01 -0400]',)], ['value'])

ts_string = F.regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}) -\d{4}]', 1)
ts_parsed = F.to_timestamp(ts_string, 'dd/MMM/yyyy:HH:mm:ss')

df = df.withColumns({
    'v1': F.weekday(ts_parsed),
    'v2': F.dayofweek(ts_parsed),
    'v3': F.extract(F.lit('dow_iso'), ts_parsed),
    'v4': F.date_format(ts_parsed, 'F'),
    'v5': F.dayname(ts_parsed),
    'v6': F.date_format(ts_parsed, 'E'),
    'v7': F.date_format(ts_parsed, 'EEEE'),
    'v8': F.to_csv(F.struct(F.to_date(ts_parsed)), {'dateFormat': 'E', 'locale': 'zh'}),
})

df.show()
# +--------------------+---+---+---+---+---+---+--------+----+
# |               value| v1| v2| v3| v4| v5| v6|      v7|  v8|
# +--------------------+---+---+---+---+---+---+--------+----+
# |[01/Jul/1995:00:0...|  5|  7|  6|  1|Sat|Sat|Saturday|周六|
# +--------------------+---+---+---+---+---+---+--------+----+

Comments

-5

I finally resolved the question myself, here is the complete solution:

  1. import date_format, datetime, DataType
  2. first, modify the regexp to extract 01/Jul/1995
  3. convert 01/Jul/1995 to DateType using func
  4. create a udf dayOfWeek to get the week day in brief format (Mon, Tue,...)
  5. using the udf to convert the DateType 01/Jul/1995 to weekday which is Sat enter image description here

I am not satisfied with my solution as it seems to be so zig-zag, it would be appreciated if anyone can come up with a more elegant solution, thank you in advance.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.