I have a csv with a few hundred thousand lines and I'm trying to change the date format in the second field. I should also add the second field is sometimes not populated at all.
The deplorable input format is DayofWeek MonthofYear DayofMonth Hour:Minute:Second Timezone Year
Example:
Mon Jul 03 14:48:54 EDT 2023
My desired output format is YYYY-MM-DD HH:MM:SS
Example:
2023-07-03 14:48:54
I am familiar with sed, so I got this sed regex replace line to get it in almost the right format, but the month not being a number is an issue.
sed -E "s/[A-Za-z]{3}\s([A-Za-z]{3})\s([0-9]{2})\s([0-9]{2}:[0-9]{2}:[0-9]{2})\s[A-Z]+\s([0-9]{4})/\4-\1-\2 \3/"
I don't think its possible to run the date command inside the sed replace section using the capture group 1 (but please correct me if I'm wrong).
I don't know how to go about referencing the month and parsing it with the date command once the sed command finishes, and I think it would be better to do the processing without piping the entire output to another command. This command is just one in a long line of piped commands for formatting the rest of the data.
It seems that maybe awk can do the entire formatting all at once, but I don't really know how to use awk that well.
What's the most efficient way to get the timestamp into the correct format?
Just to address some of the comments with more background info:
This data is generated by an app that outputs csv log data to a file. It is not my app and there is no configuration control over how the app logs. The CSV is unqouted (even if data in the field contains spaces) and empty fields contain nothing.
I am loading the csv data directly into a mysql database. While timezone would be a good idea generally, this data is always timestamped with the local time and when visualizing the data (grafana), I have no need to store it in UTC then convert to EDT just for viewing (why convert the time to UTC just to convert it back to EDT). Plus, each csv line contains longitude and latitude (so if I wanted to go back and change the timestamp to UTC, it wouldn't be impossible to figure out what local time was).
The additional formatting I am doing is not much, and probably could be done with awk (again, I am not too familiar with the syntax there). It doesn't help that the original data needs an ID column added, and qoutes put around some fields, and there are two date-time fields in TWO different formats. So my long and terrible pipe line generally looks like this:
cat file | add ID column | format timestamp in second csv field | format timestamp in third csv field | qoute any field with spaces | replace empty fields with \N > output file
I had some trouble with mysql and empty fields, so I added the explicit null character. There is definitely better ways to do this, once I get the whole process working I'll go back through and simplify.
I do very appreciate everyones responses.
perl
script.perl
willsplit
on spaces, can use an associative array to translate the months ($mnumber{"Jul"}="07"
), easy.csv
file I/O, ... but there is a significant learning curve.date
, because it happens to be designed for that purpose. If the timestamp is at a given place inside a line and you have GNUsed
, I suggest to use thee
excute flag of thes
ubstitute command (if you are not scared by the security implications).