2

I have a csv with a few hundred thousand lines and I'm trying to change the date format in the second field. I should also add the second field is sometimes not populated at all. The deplorable input format is DayofWeek MonthofYear DayofMonth Hour:Minute:Second Timezone Year

Example:

Mon Jul 03 14:48:54 EDT 2023

My desired output format is YYYY-MM-DD HH:MM:SS Example:

2023-07-03 14:48:54

I am familiar with sed, so I got this sed regex replace line to get it in almost the right format, but the month not being a number is an issue.

sed -E "s/[A-Za-z]{3}\s([A-Za-z]{3})\s([0-9]{2})\s([0-9]{2}:[0-9]{2}:[0-9]{2})\s[A-Z]+\s([0-9]{4})/\4-\1-\2 \3/"

I don't think its possible to run the date command inside the sed replace section using the capture group 1 (but please correct me if I'm wrong).

I don't know how to go about referencing the month and parsing it with the date command once the sed command finishes, and I think it would be better to do the processing without piping the entire output to another command. This command is just one in a long line of piped commands for formatting the rest of the data.

It seems that maybe awk can do the entire formatting all at once, but I don't really know how to use awk that well.

What's the most efficient way to get the timestamp into the correct format?

Just to address some of the comments with more background info:

This data is generated by an app that outputs csv log data to a file. It is not my app and there is no configuration control over how the app logs. The CSV is unqouted (even if data in the field contains spaces) and empty fields contain nothing.

I am loading the csv data directly into a mysql database. While timezone would be a good idea generally, this data is always timestamped with the local time and when visualizing the data (grafana), I have no need to store it in UTC then convert to EDT just for viewing (why convert the time to UTC just to convert it back to EDT). Plus, each csv line contains longitude and latitude (so if I wanted to go back and change the timestamp to UTC, it wouldn't be impossible to figure out what local time was).

The additional formatting I am doing is not much, and probably could be done with awk (again, I am not too familiar with the syntax there). It doesn't help that the original data needs an ID column added, and qoutes put around some fields, and there are two date-time fields in TWO different formats. So my long and terrible pipe line generally looks like this:

cat file | add ID column | format timestamp in second csv field | format timestamp in third csv field | qoute any field with spaces | replace empty fields with \N > output file

I had some trouble with mysql and empty fields, so I added the explicit null character. There is definitely better ways to do this, once I get the whole process working I'll go back through and simplify.

I do very appreciate everyones responses.

9
  • 1
    I'd use a perl script. perl will split on spaces, can use an associative array to translate the months ($mnumber{"Jul"}="07"), easy .csv file I/O, ... but there is a significant learning curve.
    – waltinator
    Commented Jul 4, 2023 at 4:27
  • is the time zone required in the output?
    – memchr
    Commented Jul 4, 2023 at 4:39
  • 1
    The most efficient way is to pipe it through date, because it happens to be designed for that purpose. If the timestamp is at a given place inside a line and you have GNU sed, I suggest to use the eexcute flag of the substitute command (if you are not scared by the security implications).
    – Philippos
    Commented Jul 4, 2023 at 5:16
  • 1
  • 2
    What should the output be in the case you mention where "the second field is sometimes not populated at all."? Are your fields quoted or not? Can any field include commas, escaped quotes, or newlines? edit your question to answer those questions and include concise, testable sample input and expected output?
    – Ed Morton
    Commented Jul 4, 2023 at 10:41

4 Answers 4

8

With GNU sed, you can use the s///e modifier to execute the resultant string:

s/.*/date -d "&" +"%F %T"/e

Better than that, though, is to use GNU date's -f flag, which itself can process lines of input, rather than spawning a new process for each line:

$ TZ=UTC0 date -f /dev/stdin +'%F %T' <<<$'Mon Jul 03 14:48:54 EDT 2023\nTue, 04 Jul 2023 11:30:45 +0100'
2023-07-03 18:48:54
2023-07-04 10:30:45

This is also much safer if you're not in a position to trust the input.

5
  • The sed one is as dangerous and as inefficient as the awk + |getline version though and also needs GNU date (see my comments under there). Commented Jul 4, 2023 at 12:04
  • Yes, @Stéphane; that's why I only mentioned it as an introducer to the real answer. Commented Jul 4, 2023 at 12:06
  • 1
    Note that GNU sed's e like awk's |getline runs sh to interpret the shell code resulting from the substitution, so that's both sh and date being executed, possibly in 2 processes with those sh implementations that don't optimise out the fork for the last command. Commented Jul 4, 2023 at 12:09
  • The second method doesn't seem to be able to process the whole CSV file. It says invalid date.
    – memchr
    Commented Jul 4, 2023 at 16:08
  • @memchr - feeding it the entire CSV file is obviously the wrong thing to do. The example input and desired output shown in the question implies that the column had already been extracted. If not, it's easy to apply the CSV equivalent of cut and paste to do so. Commented Jul 5, 2023 at 6:27
7

You could do something like:

LC_ALL=C sed '
  s/$/;Jan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/
  s/[A-Z][a-z][a-z] \([A-Z][a-z][a-z]\) \([0-9][0-9]\) \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [A-Z]\{3,\} \([0-9]\{4\}\)\(.*;.*\1\([01][0-9]\)[^;]*\)$/\4-\6-\2 \3\5/
  s/;[^;]*$//'

Where we first append a month name to number translation table at the end of the line (separated with a ;), then look up the number for a given month name by using back references (for which we need BREs, not EREs¹) using a ...\([A-Z][a-z][a-z]\)...;.*\1\([01][0-9]\)... regexp, so the \1 back references the month name captured in the text, and the two digits following that end up in \6.

Then we remove our translation table.

If there may be more than one timestamp to be converted per line, change it to:

LC_ALL=C sed '
  s/$/;Jan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/
  :1
    s/[A-Z][a-z][a-z] \([A-Z][a-z][a-z]\) \([0-9][0-9]\) \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [A-Z]\{3,\} \([0-9]\{4\}\)\(.*;.*\1\([01][0-9]\)[^;]*\)$/\4-\6-\2 \3\5/
  t1
  s/;[^;]*$//'

where t1 branches to the :1 label only if there has been a successful substitution which is a common way to implement loops in sed.

For arbitrary header-less CSVs, only reformatting the first field:

mlr --csv -N put '$1 = strftime(strptime($1, "%a %b %d %H:%M:%S %Z %Y"), "%F %T")'

(adapted from @Kusalananda's answer to How can I transform dates expressed with month names to numerical month designations?).

Miller's strptime() complains if the timestamp cannot be decoded but apparently not in the case where the field is empty.

%Z is not among the directives recognised by standard strptime(), but the GNU implementation at least recognises and ignores it (and consumes \s*\S* from the input; there's nothing it could do with it given those EDT and co have different meaning over time and for different people).


¹ though some sed implementations, including GNU sed which you may be using as you're using the \s GNUism, support them for EREs as well as an extension over the standard.

4
  • 1
    Can you explain the sed command a little more? I have never seen a multi expression reference a substitution in a previous line, how does the first substitution get the month capture group passed to it? I am thinking I could format both timestamps in my csv, then go back and regex find the 3 letter month name and then do the lookup for the number.
    – user432564
    Commented Jul 4, 2023 at 15:39
  • @ehammer, see edit. Commented Jul 4, 2023 at 16:32
  • See also pass arguments to the date command in LHS of sed. This is a common technique in sed, not limited to month name translation. Commented Jul 4, 2023 at 16:36
  • 1
    After understanding the sed, that's so genius!
    – annahri
    Commented Jul 4, 2023 at 16:40
3

You did mention:

I'm trying to change the date format in the second field. I should also add the second field is sometimes not populated at all.

The following awk script will satisfy the requirement. Save it as date.awk (thanks to @EdMorton for the nitpicks) :

BEGIN {
  FS = OFS = ","
  months = "JanFebMarAprMayJunJulAugSepOctNovDec" 
}

$2 != "" {
  split($2, date, / /)
  month = sprintf("%02d", (index(months, date[2]) + 2) / 3)
  $2 = sprintf("%04d-%02d-%02d %s", date[6], month, date[3], date[4])
}

1

then execute awk using the script:

awk -f date.awk input.csv

Original Answer

You can easily change date formats using date command. For instance:

$ date -d "Mon Jul 03 14:48:54 EDT 2023" +"%Y-%m-%d %H:%M:%S"
2023-07-03 14:48:54

Then, you can use awk to convert only a specific column (in this example, $1):

awk 'BEGIN {FS=OFS=","} {"date -d \"" $1 "\" +\"%Y-%m-%d %H:%M:%S\"" | getline res; $1=res; print}' file.csv

The result will be in your local time, so if you want to convert the timezone, then simply add TZ=EDT (or any timezone) before date.

However, as per @StéphaneChazelas mentioned in the comments, this is vulnerable to command injection if somehow the field in a line contains malicious command, and this will run slowly because it needs to execute sh and date for each line.

6
  • 1
    Note that that assumes GNU date and that makes it a command injection vulnerability if the second field is not guaranteed to be a date. That also means forking one possible two processes per line and execute sh and date for each line of the file, so it going to be very inefficient. Commented Jul 4, 2023 at 6:47
  • 1
    nitpick - move months = "JanFebMarAprMayJunJulAugSepOctNovDec" into the BEGIN section so you do it once instead of once per input line.
    – Ed Morton
    Commented Jul 4, 2023 at 10:38
  • 1
    Other nitpick - you could make the main part a bit more concise as {split($1, d, / /); month = index(months, d[2]) + 2) / 3; $1 = sprintf("%04d-%02d-%02d %s", d[6], month, d[3], d[4]); print} (not suggesting cramming it all on 1 line).
    – Ed Morton
    Commented Jul 4, 2023 at 10:49
  • Thank you for the awk script. That did the job well.
    – user432564
    Commented Jul 4, 2023 at 22:34
  • I've noticed a small issue with the awk script. If the input file never contains a timestamp in the second field, then the awk command doesn't produce any output. the $2 != "" doesn't seem to fully act like an if statement, so if it doesn't match it just doesn't output the original line.
    – user432564
    Commented Jul 11, 2023 at 20:49
2

If efficiency is a concern, it is better to use a scripting language, as there will be no excessive calling of external commands.

Here is an example Python script for reference only

from datetime import datetime
import re
import csv


def convert_datetime(dt):
    # as `EDT`` isn't in zoneinfo, it would need to be removed
    date_string = re.sub("(\w+ \w+ \d+ \d+:\d+:\d+) \w+ (\w+)", r"\1 \2", dt)
    date_obj = datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y")
    return date_obj.strftime("%Y-%m-%d %H:%M:%S")


with open("original.csv", "r") as infile, open("processed.csv", "w") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    header = next(reader, None)
    if header:
        writer.writerow(header)
    for row in reader:
        # convert datetime in the second field
        try:
            row[1] = convert_datetime(row[1])
        except ValueError:
            pass
        writer.writerow(row)

You must log in to answer this question.