2

I am trying to convert a specific log file into CSV file using sed, awk, paste commands in Linux to be able to plot it using gnuplot or MS Excel. However, I am not able to do it in the way I want. Here is the sample log file:

Feb 15 13:57:08 Program1: The pool size: 100 [High: 80 Norm: 20 Low: 0]
Feb 15 13:58:53 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 13:58:54 Program3: The pool size: 200 [High: 0 Norm: 200 Low: 0]
Feb 15 13:58:56 Program4: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 13:58:58 Program1: The pool size: 200 [High: 0 Norm: 200 Low: 0]
Feb 15 13:58:59 Program5: The pool size: 300 [High: 100 Norm: 200 Low: 0]
Feb 15 13:59:05 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:11 Program2: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:12 Program2: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:13 Program1: The pool size: 200 [High: 0 Norm: 200 Low: 0]
Feb 15 14:00:16 Program4: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:00:17 Program2: The pool size: 100 [High: 50 Norm: 50 Low: 0]
Feb 15 14:02:28 Program5: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:02:31 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]
Feb 15 14:11:01 Program1: The pool size: 100 [High: 0 Norm: 100 Low: 0]

I am trying to convert the above data into a CSV file such that I would have the data at specific point of time. The output CSV I expect should be in the following format:

TimeStamp,Program1_Total,Program1_High,Program1_Norm,Program1_Low,Program2_Total,Program2_High,Program2_Norm,Program2_Low,Program3_Total,Program3_High,Program3_Norm,Program3_Low,Program4_Total,Program4_High,Program4_Norm,Program4_Low

Feb 15 13:57:08,100,80,20,0,0,0,0,0,0,0,0,0,0,0,0,0
Feb 15 13:58:53,100,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0
...
...

What did I try?

I tried grepping for specific program and create separate smaller files specific to that program in the following way:

grep "Program1" sample.log > Program1.log
grep "Program2" sample.log > Program2.log

I tried using paste command to join them. However, what I am not able to figure out is how to handle these timestamps in a better way.

Any help will be highly appreciated. Thanks in advance.

5
  • What have you tried so far? Do you have a sample script that you are working on? Commented Feb 16, 2018 at 7:41
  • You need to be more specific in your output. It looks like that you want different lines to be joined into one line. But the "TimeStamp" is different for all these lines, what do you expect? Commented Feb 16, 2018 at 7:43
  • @StefanM I want to have a unique row for every unique timestamp that is present in the log file. If you do not have any data for that program at that point of time, put zeros against that program's data. Commented Feb 16, 2018 at 7:45
  • Please change the input data, then. Because you have no duplicate timestamps and the result won't be satisfactory. And still: What have you tried so far? Commented Feb 16, 2018 at 7:47
  • @StefanM There may or may not be duplicate timestamps. Assuming that to be the case. How can I achieve this? Commented Feb 16, 2018 at 7:50

3 Answers 3

1

I think i found a 1 liner solution for your task which only uses the shell and awk, but be advised, it's not pretty at all and you need to add the header to your output file beforehand:

echo "TimeStamp,P1_Total,P1_High,P1_Norm,P1_Low,P2_Total,P2_High,P2_Norm,P2_Low,P3_Total,P3_High,P3_Norm,P3_Low,P4_Total,P4_High,P4_Norm,P4_Low,P5_Total,P5_High,P5_Norm,P5_Low" >> final_output.txt

for i in `seq 1 5` 
do 
l=$((i-1))
r=$((5-i))
awk -v left_padd=${l} -v right_padd=${r} -v nb=${i} '{gsub(/]/, "", $14)} {if ($4 ~ "Program" nb) {printf $1" "$2" "$3", "; for(a=0;a<left_padd;a++) printf "0,\t 0,\t 0,\t 0,\t "; printf $8",\t "$10",\t "$12",\t "$14",\t "; for(b=0;b<right_padd;b++) printf "0,\t 0,\t 0,\t 0,\t "; print "\n"} }' sample.log
done >> final_output.txt

*** Please, note you must change the 5 in seq 1 5 to the number of Program# entries you wish to have in your output file, I used 5 as that was in your example. Also, you need to change the 5 in r=$((5-i)) to the same value as well.

Explanation:

  • The for loop passes the file every time to search for a Program# entry with awk.
  • The l variable counts how many 0 values it should add at the left of your table.
  • The r variable does the same as the l value only it adds 0 values to the right.
  • The nb variable stores the Program # so the awk part knows which lines it should look for in the input file.
  • The awk merely prints out the values you asked for in the input file for each Program# entry as well as the preceding and trailing 0 values(4 0s for each Program#) for the other entries in the table.

Edit:

I used \t to delimit the values in awk so it's easier to read, but you may remove that so you only have comma separated values. I also changed the header convention from your answer from Program#_Total to P#_Total for the same reason.

*I do realize this is not optimal at all, as the file gets parsed multiple time for each Program# entry, and you also need to add the header yourself in the output file, yet it's the best I could come up with.

Sign up to request clarification or add additional context in comments.

1 Comment

I had to do few changes here and there to get this working for me. But this helped me a lot in formulating the solution. Hence, marking this answer as accepted.
1

Use cut by using space as divider, then preserve only the fields you need. Once done, use sed to replace spaces with commas.

cut -d ' ' -f 1,2,3,8,10,12,14 && sed 's/ /,/g'

By using into a while .. read loop you can iterate it in each line.

Comments

1

If Perl is in the options, how about:

#!/bin/bash

perl -e '
while (<>) {
    if (/^(.{15}) Program(\d+): The pool size: (\d+) \[High: (\d+) Norm: (\d+) Low: (\d+)\]$/) {
        $timestamp = $1;
        $program = $2;
        $size = $3;
        $high = $4;
        $norm = $5;
        $low = $6;
        if (! defined $array{$timestamp}) {
            # it takes care of duplicate timestamps
            push(@timestamps, $timestamp);
        }
        $i = ($program - 1) * 4;
        @{$array{$timestamp}}[$i .. $i + 3] = ($size, $high, $norm, $low);
    }
}
foreach (@timestamps) {
    print "$_,", join(",", map {$_ + 0} @{$array{$_}}[0 .. 15]), "\n";
}' logfile

BTW it looks like Program5 is excluded in your desired result. If you want to include it, just modify the number 15 in the 2nd last line into 19.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.