1

I'm scraping data from a news site and want to store the time and date these articles were posted. The good thing is that I can pull these timestamps right from the page of the articles.

When the articles I scrape were posted today, the output looks like this:

17:22 ET
02:41 ET
06:14 ET

When the articles were posted earlier than today, the output looks like this:

Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET

Current problem: I can't order my database by the time the articles were posted, because whenever I run the program, articles that were posted today are stored only with a time. Over multiple days, this will create a lot of articles with a stamp that looks as if they were posted on the day you look at the database - since there is only a time.

What I want: Add the current month/day/year in front of the time stamp on the basis of the already given format.

My idea: I have a hard time to understand how regex works. My idea would be to check the length of the imported string. If it is exactly 8, I want to add the Month, Date and Year in front. But I don't know whether this is a) the most efficient approach and b) most importantly, how to code this seemingly easy idea.

I would glady appreciate if someone can help me how to code this. The current line which grabs the time looks like this:

article_time = item.select_one('h3 small').text
4
  • I don't mind your basic idea. If the string doesn't contain a comma, then it is just a time, and you need to prepend the date. Commented Mar 12, 2021 at 22:13
  • No, not at all. The problem is I don't get the syntax in datetime to add the month (in short form), date and year yet :( Commented Mar 12, 2021 at 22:15
  • Risking getting slightly off-topic for SO, you can try to use more forgiving date-time parsing utilities, such as arrow or dateparser (which was specifically developed to handle formats commonly used on websites, like "x minutes ago"). If I remember correctly, if the date is missing they will use current date as default
    – DeepSpace
    Commented Mar 12, 2021 at 22:15
  • @Niklas: You can use the replace() method to add things to a datetime object.
    – martineau
    Commented Mar 13, 2021 at 11:53

3 Answers 3

1

Try this out and others can correct me if I overlooked something,

from datetime import datetime, timedelta

def get_datetime_from_time(time):
    time, timezone = time.rsplit(' ', 1)
    if ',' in time:
        article_time = datetime.strptime(time, r"%b %d, %Y, %H:%M")
    else:
        article_time = datetime.strptime(time, r"%H:%M")
        hour, minute = article_time.hour, article_time.minute
        if timezone == 'ET':
            hours = -4
        else:
            hours = -5
        article_time = (datetime.utcnow() + timedelta(hours=hours)).replace(hour=hour, minute=minute) # Adjust for timezone
    return article_time
        

article_time = item.select_one('h3 small').text
article_time = get_datetime_from_time(article_time)

What I'm doing here is I'm checking if a comma is in your time string. If it is, then it's with date, else it's without. Then I'm checking for timezone since Daylight time is different than Standard time. So I have a statement to adjust timezone by 4 or 5. Then I'm getting the UTC time (regardless of your timezone) and adjust for timezone. strptime is a function that parses time depending on a format you give it.

Note that this does not take into account an empty time string.

4
  • This is so far the best answer, but it doesn't produce output that can be ordered chronologically. The tricky part here is the "ET" short timezone code, otherwise, it's pretty simple to convert your output into a datetime object. Commented Mar 12, 2021 at 22:42
  • @MichaelRuth But I thought I handled it, right? By stripping out the timezone and handling for it in -4/-5.
    – thethiny
    Commented Mar 13, 2021 at 4:48
  • What if there are entries with PT, MT, CT, AT, or other timezones? I realize the OP only posted data with ET timezone, but it's pretty reasonable to assume that other timezones may exist in the complete dataset, hopefully none with the Eastern European Timezone (EET) since I'm uncertain how it would be differentiated from the Eastern Timezone with only a two character code. Commented Mar 14, 2021 at 4:04
  • @MichaelRuth Yes I am with you on that. That's why I added an if statement there. I'm not going to be solving OP's every edge case, he needs to do that on his own, I only provide the template. But you're right.
    – thethiny
    Commented Mar 14, 2021 at 5:45
0

Handling timezones properly can get fairly involved since the standard library barely supports them (and recommends using the third-party pytz module) to do so). This would be especially true if you need it

So, one "quick and dirty" way to deal with them would be to just ignore that information and add the current day, month, and year to any timestamps encountered that don't include that. The code below demonstrates how to do that.

from datetime import datetime


scrapped = '''
17:22 ET
02:41 ET
06:14 ET
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
'''

def get_datetime(string):
    string = string[:-3]  # Remove timezone.
    try:
        r = datetime.strptime(string, "%b %d, %Y, %H:%M")
    except ValueError:
        try:
            today = datetime.today()
            daytime = datetime.strptime(string, "%H:%M")
            r = today.replace(hour=daytime.hour, minute=daytime.minute, second=0, microsecond=0)
        except ValueError:
            r = None
    return r

for line in scrapped.splitlines():
    if line:
        r = get_datetime(line)
        print(f'{line=}, {r=}')
7
  • "the standard library barely supports them" - Python 3.9 has zoneinfo, I'd consider this rather good support of time zones ;-) Commented Mar 13, 2021 at 9:25
  • @MrFuppes: In that case feel free to post an answer of your own.
    – martineau
    Commented Mar 13, 2021 at 9:30
  • tempting... don't get me wrong, I'm not criticizing the code you suggest, just that people new to Python should be pointed to up-to-date possibilities / libraries. It's easy to get tz handling wrong with pytz; I think zoneinfo does a better job there. Commented Mar 13, 2021 at 10:00
  • @MrFuppes: It's not clear to me how zoneinfo is going to help parse time strings with (non-standard) time zone abbreviations — which was part of the reason why I invited you to show everyone how to apply it to this problem.
    – martineau
    Commented Mar 13, 2021 at 10:14
  • again, that was not my point. zoneinfo is for handling time zones, not strings. You claim that the standard lib barely supports the first. And I think that claim is misleading since Python 3.9. Commented Mar 13, 2021 at 10:17
0

"I can't order my database" - to be able to do so, you'll either have to convert the strings to datetime objects or to an ordered format (low to high resolution, so year-month-day- etc.) which would allow you to sort strings correctly.

"I have a hard time to understand how regex works" - while you can use regular expressions here to somehow parse and modify the strings you have, you don't need to.

#1 If you want a convenient option that leaves you with datetime objects, here's one using dateutil:

import dateutil

times = ["17:22 ET", "02:41 ET", "06:14 ET",
         "Mar 10, 2021, 16:05 ET", "Mar 08, 2021, 08:00 ET", "Feb 26, 2021, 11:23 ET"]

tzmapping = {'ET': dateutil.tz.gettz('US/Eastern')}

for t in times:
    print(f"{t:>22} -> {dateutil.parser.parse(t, tzinfos=tzmapping)}")
              17:22 ET -> 2021-03-13 17:22:00-05:00
              02:41 ET -> 2021-03-13 02:41:00-05:00
              06:14 ET -> 2021-03-13 06:14:00-05:00
Mar 10, 2021, 16:05 ET -> 2021-03-10 16:05:00-05:00
Mar 08, 2021, 08:00 ET -> 2021-03-08 08:00:00-05:00
Feb 26, 2021, 11:23 ET -> 2021-02-26 11:23:00-05:00

Note that you can easily tell dateutil's parser to use a certain time zone (e.g. to convert 'ET' to US/Eastern) and it also automatically adds today's date if the date is not present in the input.

#2 If you want to do more of the parsing yourself (probably more efficient), you can do so by extracting the time zone first, then parsing the rest and adding a date where needed:

from datetime import datetime
from zoneinfo import ZoneInfo # Python < 3.9: you can use backports.zoneinfo

# add more if you not only have ET...
tzmapping = {'ET': ZoneInfo('US/Eastern')}

# get tuples of the input string with tz stripped off and timezone object
times_zones = [(t[:t.rfind(' ')], tzmapping[t.split(' ')[-1]]) for t in times]

# parse to datetime
dt = []
for t, z in times_zones:
    if len(t)>5: # time and date...
        dt.append(datetime.strptime(t, '%b %d, %Y, %H:%M').replace(tzinfo=z))
    else: # time only...
        dt.append(datetime.combine(datetime.now(z).date(), 
                                   datetime.strptime(t, '%H:%M').time()).replace(tzinfo=z))
        
for t, dtobj in zip(times, dt):
    print(f"{t:>22} -> {dtobj}")
4
  • Congrats on showing how to make use of zoneinfo to solve the problem. Regardless, I think both your #1 and #2 exemplify my assertion that "…the standard library barely supports them".
    – martineau
    Commented Mar 13, 2021 at 11:45
  • @martineau: well, I'd say this depends on how you define "support" or what you're used to from other languages - but given the fact that before Python 3.9, even IANA time zone names weren't handled by the standard lib, I'm with you - I wouldn't consider this good support. Regarding the question, ET is unambiguous and a good parser should support the mapping to the IANA tz in my opinion. Commented Mar 13, 2021 at 12:00
  • I think ET was just an example the OP used. I would consider good support being able to parse most standard timezone abbreviations. Since the data is being scrapped from an unspecified "news site" the timezones involved could conceivably be worldwide, which makes handling them all a very onerous task — hence the need for some third-party library to handle it.
    – martineau
    Commented Mar 13, 2021 at 12:18
  • @martineau: although widespread, these time zone abbreviations can be a pain, even for a "good" parser - just think about the "BS" time zones, there are three times BST. Which one should the parser choose? That seems error-prone to me as well. Ok, getting off-topic here ;-) Commented Mar 13, 2021 at 12:24

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.