I'm scraping data from a news site and want to store the time and date these articles were posted. The good thing is that I can pull these timestamps right from the page of the articles.
When the articles I scrape were posted today, the output looks like this:
17:22 ET
02:41 ET
06:14 ET
When the articles were posted earlier than today, the output looks like this:
Mar 10, 2021, 16:05 ET
Mar 08, 2021, 08:00 ET
Feb 26, 2021, 11:23 ET
Current problem: I can't order my database by the time the articles were posted, because whenever I run the program, articles that were posted today are stored only with a time. Over multiple days, this will create a lot of articles with a stamp that looks as if they were posted on the day you look at the database - since there is only a time.
What I want: Add the current month/day/year in front of the time stamp on the basis of the already given format.
My idea: I have a hard time to understand how regex works. My idea would be to check the length of the imported string. If it is exactly 8, I want to add the Month, Date and Year in front. But I don't know whether this is a) the most efficient approach and b) most importantly, how to code this seemingly easy idea.
I would glady appreciate if someone can help me how to code this. The current line which grabs the time looks like this:
article_time = item.select_one('h3 small').text
arrow
ordateparser
(which was specifically developed to handle formats commonly used on websites, like "x minutes ago"). If I remember correctly, if the date is missing they will use current date as defaultreplace()
method to add things to adatetime
object.