Manipulating text using shell script: How can I fill in "missing" lines?

Question

I have a list of data, like a CSV but some lines are missing a value. I'd like to generate a value for the missing line based on the lines before and after using linux shell script.
Take this table for instance.

line	person	age
1	Adam	45
2	Bob	50
3	Cindy	47
4	*	#
5	Ed	49

What I'd like to do is fill in the "*" in line 4 with "Cindy:Ed" (a concatenation of the nearest, valid data in each direction in column B with a ":" delimiter) and the "#" with 48 (the average of 47 and 49, the nearest valid data points in each direction from column C).

Output:

line	person	age
1	Adam	45
2	Bob	50
3	Cindy	47
4	Cindy:Ed	48
5	Ed	49

My data is formatted as a space-delimited text file of arbitrary row count. All rows are three columns.

While I know my way around a For loop and grep etc., I'm at a loss as to how I'd handle this in vanilla linux shell script.

My guess is to make an initial pass to find the lines that have asterisks and hashes. Then make a second pass to replace the asterisks with (awk '{print $2}'):(awk '{print $2}') of the lines before and after, respectively.

If the missing data is on the first or last line, I'm happy to leave it as is. If missing data is on consecutive lines, I'm ok with setting all missing lines to the same "Cindy:Ed" and same average. It'd be even cooler if I could set "Cindy:Ed:1" and Cindy:Ed:2" etc.

An accurate example of worst case scenario raw input: (it's a traceroute with added "#" for the missing latency)


1 192.168.200.2 1
2 192.168.200.1 1
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 * #
7 11.22.44.66 9
8 * #
9 * #
10 8.8.8.0 25
11 * #
12 * #
13 * #

What I'd like:

1 192.168.200.2 1
2 192.168.200.1 1
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7
7 11.22.44.66 9
8 11.22.44.66:8.8.8.0 17
9 11.22.44.66:8.8.8.0 17
10 8.8.8.0 25
11 * #
12 * #
13 * #

@KamilMaciorowski Sorry. Good catch! When I got to the second part I changed it to a 9 to make the math cleaner and didn't update the original. fixed now. FWIW, I don't care if the "average" is an integer or floating. — Darius Dauer, Commented Jan 26, 2022 at 17:52
@FelixJN I updated post to say consecutively missing values can search outward to the nearest valid data points. Also row count shouldn't go above 30 if that matters. These are traceroutes after all. — Darius Dauer, Commented Jan 26, 2022 at 17:55
What should the output be if the first line has a * for the 2nd field? — Ed Morton, Commented Jan 26, 2022 at 20:36

FelixJN · Accepted Answer · 2022-01-26 20:46:21Z

With awk:

#if a previous line with proper IP has been read
oldip != "" {
#i is counter for consecutive invalid lines
    i=0
#if IP is set, just print and go to next record
    if ($2!="*") {
        print ; oldip=$2 ; oldlat=$3 ; next
    }
#otherwise get following line and increase counter
    else {
#getline exit status => fails for the last line
        while (getline > 0) {i++
#check if new line has IP, if so
#set IPold:IPnew and average latency value
            if ($2!="*") {
                ipfill=oldip":"$2 ; latfill=(oldlat+$3)/2
#print filler lines for all consecutive records without value
                for (j=1 ; j<=i ; j++) {
                    print NR-i+j-1,ipfill,latfill
#alternative printing with oldIP:newIP:counter
#                   print NR-i+j-1,ipfill":"j,latfill
                }
#save current IP+lat and print "good" line
                oldp=$2; oldlat=$3
                print ; next
            }
        }
    }
#in case getline failed => all previous lines had no value
#just fill them with N/A data as in input
    for (j=0 ; j<=i ; j++) {
        print NR-i+j,"*","#"
    }
}

#If leading lines have no IP value, print them until IP is found
oldip == "" { if ($2=="*") {print ; next} ; oldip=$2 ; oldlat=$3 ; print }

Input:

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 * #
7 11.22.44.66 10
8 * #
9 * #
10 8.8.8.0 25
11 * #
12 * #
13 * #

Output:

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7.5
7 11.22.44.66 10
8 11.22.33.55:8.8.8.0 17.5
9 11.22.33.55:8.8.8.0 17.5
10 8.8.8.0 25
11 * #
12 * #
13 * #

Alternative output with counter for calculated lines:

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66:1 7.5
7 11.22.44.66 10
8 11.22.33.55:8.8.8.0:1 17.5
9 11.22.33.55:8.8.8.0:2 17.5
10 8.8.8.0 25
11 * #
12 * #
13 * #

It'll take me a minute to wrap my brain around this but it worked as soon as I figured out how to use it. Thanks! — Darius Dauer, Commented Jan 26, 2022 at 18:58
@EdMorton I see, corrected to while (getline > 0) - I missed this in the manuals: "If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1". The problem regarding output is a bit confusing regarding OPs requirement of "average", but surely can be handled with printf "%s %s %d\n" as alternative. — FelixJN, Commented Jan 26, 2022 at 20:49
@EdMorton it is working. I just "changed the goalpost" When he copied my original input data it had line 6 ending in 10. but I edited line 6 later on to show 9 to make the math prettier. His script works right with the data he used. — Darius Dauer, Commented Jan 26, 2022 at 20:49
@FelixJN yeah, getline is a slippery one which is why I wrote awk.freeshell.org/AllAboutGetline. It's one of those things that looks easy to use but requires a lot of thought and defensive coding and makes future enhancements harder. It's usually simpler to just not use it and let awk's normal processing loop handle the input. — Ed Morton, Commented Jan 26, 2022 at 20:52

Ed Morton · Accepted Answer · 2022-01-27 12:12:54Z

3

$ cat tst.awk
$2 == "*" {
    buf[++bufSz] = $0
    next
}
bufSz > 0 {
    split(prev,p)
    rng = p[2] ":" $2
    val = ($3 + p[3]) / 2
    for (i=1; i<=bufSz; i++) {
        split(buf[i],flds)
        print (prev == "" ? buf[i] : flds[1] OFS rng OFS val)
    }
    bufSz = 0
}
{
    print
    prev = $0
}
END {
    for (i=1; i<=bufSz; i++) {
        print buf[i]
    }
}

$ awk -f tst.awk file
1 192.168.200.2 1
2 192.168.200.1 1
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7
7 11.22.44.66 9
8 11.22.44.66:8.8.8.0 17
9 11.22.44.66:8.8.8.0 17
10 8.8.8.0 25
11 * #
12 * #
13 * #

edited Jan 27, 2022 at 12:12

answered Jan 26, 2022 at 20:27

Ed Morton

34.7k6 gold badges24 silver badges55 bronze badges

This fails if the file starts with empty data (i.e. 1 * # as first line). Expected is to see unchanged lines as from the input in this case.
– FelixJN
Commented Jan 26, 2022 at 21:02
@FelixJN that depends on the OPs currently requirements for that case, hence my question. They may simply never have that situation, idk, but obviously it's not part of the sample input/output they provided for us to test with.
– Ed Morton
Commented Jan 26, 2022 at 22:04
1

@FelixJN actually, I see now the OP did say in the text If the missing data is on the first or last line, I'm happy to leave it as is so I updated my answer to accommodate that.
– Ed Morton
Commented Jan 26, 2022 at 22:48

Add a comment |

guest_7 · Accepted Answer · 2022-01-30 08:21:49Z

With GNU sed using it's extended regex mode (-E)

S='(\S+)'; _re="$S $S"
re="^$_re\\n$_re\$"
_avg='1k\4 \2+2/f'
avg='"$(echo '"'$_avg'"'|dc)"'

sed -E '
  s/^(\S+ )[*] #(\n.*\n(.*))/\1\3\2/
  ta
  s/\n.*//

  /[*] #$/b
  $!N;//!ba

  :loop
  ${//q;bb}
  N;//bloop

  :b;h
  s/\n.*\n/\n/
  s/^\S+ //Mg'"
  s#$re#echo '\1:\3' $avg#e
  x;G

  :a
  P;D
" file

Perl using the range operator

perl -lane 'print,next
  unless my $e = /\d$/ ... /\d$/;
  push @A,[@F]; next
  unless $e =~ /E0/ || eof;
  if (@A>2&&$A[-1][-1] =~ /\d/) {
    my($str,$avg);
    for (0,-1) {
      $avg += $A[$_][2] / 2.0;
      $str .= $A[$_][1] . ":";
    }
    $str =~ s/.$//;
    @{$A[$_]}[1,2] = ($str,$avg)
      for 1..$#A-1;
  }
  print "@$_" for splice @A,0,@A-(eof?0:1);
  @A=(); redo if ! eof;
' file

python3 -c 'import sys, itertools as it
prev = ""
p = lambda x: print(*x,sep="",end="")
q = lambda x: x.split()
g = lambda x: x.endswith("* #\n")
with open(sys.argv[1]) as f:
  for t in it.groupby(f,g):
    G = list(t[1])
    if prev == "":
      p(G)
      if not t[0]: prev = G[-1]
    else:
      if t[0]: M = G
      else:
        a,b = map(q,[prev,G[0]])
        x = f"{a[1]}:{b[1]}"
        y = sum(map(int,[a[2],b[2]]))/2.0
        for l in M:
          for e in q(l)[0]:
            print(e,x,y)
        p(G); prev = G[-1]
  p(M)
' file

$ cat file
1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 * #
7 11.22.44.66 10
8 * #
9 * #
10 8.8.8.0 25
11 * #
12 * #
13 * #

Output;

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7.5
7 11.22.44.66 10
8 11.22.44.66:8.8.8.0 17.5
9 11.22.44.66:8.8.8.0 17.5
10 8.8.8.0 25
11 * #
12 * #
13 * #

guest_7 · Accepted Answer · 2022-01-31 08:24:22Z

0

If the file is not huge, we can use the slurp mode (-0777) to solve this:

perl -MList::Util=sum -0777 -pe '
  s{(.*\d\n)
    ((?-x:.*\* \#\n)+)
    (?=(.*\d\n))
  }[$1 . do{my($s,@A);
    push @A,(split)[1,2] for $1,$3;
    $s = join ":", @A[0,2];
    @A = ("", $s, sum(@A[1,3])/2);
    $2 =~ s/^\S+\K.*/@A/gmr;
  }]xge;
' file

answered Jan 31, 2022 at 8:24

guest_7

5,7681 gold badge8 silver badges13 bronze badges

Add a comment |

Praveen Kumar BS · Accepted Answer · 2022-01-28 04:24:35Z

-2

var1=$(awk '{a[++i]=$0}/#/{for(x=NR-1;x<NR;x++)print a[x]}' file.txt | awk '{print $NF}')
var2=$(awk '/#/{x=NR+1}(NR==x){print $NF}' file.txt)

sed -i "s/#/$var3/g" file.txt
sed -i "s/\*/Cindy:Ed/g" file.txt


output

cat file.txt
line    person  age
1       Adam    45
2       Bob     50
3       Cindy   47
4       Cindy:Ed       48
5       Ed      49

answered Jan 28, 2022 at 4:24

Praveen Kumar BS

5,2952 gold badges11 silver badges15 bronze badges

Read the question again, you can’t hard-code “Cindy:Ed”. Where’s var3 set?
– Stephen Kitt
Commented Jan 28, 2022 at 7:41

Add a comment |

Stack Exchange Network

Manipulating text using shell script: How can I fill in "missing" lines?

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Manipulating text using shell script: How can I fill in "missing" lines?

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions