4

I have a list of data, like a CSV but some lines are missing a value. I'd like to generate a value for the missing line based on the lines before and after using linux shell script.
Take this table for instance.

line person age
1 Adam 45
2 Bob 50
3 Cindy 47
4 * #
5 Ed 49

What I'd like to do is fill in the "*" in line 4 with "Cindy:Ed" (a concatenation of the nearest, valid data in each direction in column B with a ":" delimiter) and the "#" with 48 (the average of 47 and 49, the nearest valid data points in each direction from column C).

Output:

line person age
1 Adam 45
2 Bob 50
3 Cindy 47
4 Cindy:Ed 48
5 Ed 49

My data is formatted as a space-delimited text file of arbitrary row count. All rows are three columns.

While I know my way around a For loop and grep etc., I'm at a loss as to how I'd handle this in vanilla linux shell script.

My guess is to make an initial pass to find the lines that have asterisks and hashes. Then make a second pass to replace the asterisks with (awk '{print $2}'):(awk '{print $2}') of the lines before and after, respectively.

If the missing data is on the first or last line, I'm happy to leave it as is. If missing data is on consecutive lines, I'm ok with setting all missing lines to the same "Cindy:Ed" and same average. It'd be even cooler if I could set "Cindy:Ed:1" and Cindy:Ed:2" etc.

An accurate example of worst case scenario raw input: (it's a traceroute with added "#" for the missing latency)


1 192.168.200.2 1
2 192.168.200.1 1
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 * #
7 11.22.44.66 9
8 * #
9 * #
10 8.8.8.0 25
11 * #
12 * #
13 * #

What I'd like:

1 192.168.200.2 1
2 192.168.200.1 1
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7
7 11.22.44.66 9
8 11.22.44.66:8.8.8.0 17
9 11.22.44.66:8.8.8.0 17
10 8.8.8.0 25
11 * #
12 * #
13 * #
4
  • 1
    And for consecutively missing values?
    – FelixJN
    Commented Jan 26, 2022 at 17:31
  • @KamilMaciorowski Sorry. Good catch! When I got to the second part I changed it to a 9 to make the math cleaner and didn't update the original. fixed now. FWIW, I don't care if the "average" is an integer or floating. Commented Jan 26, 2022 at 17:52
  • 1
    @FelixJN I updated post to say consecutively missing values can search outward to the nearest valid data points. Also row count shouldn't go above 30 if that matters. These are traceroutes after all. Commented Jan 26, 2022 at 17:55
  • What should the output be if the first line has a * for the 2nd field?
    – Ed Morton
    Commented Jan 26, 2022 at 20:36

5 Answers 5

3

With awk:

#if a previous line with proper IP has been read
oldip != "" {
#i is counter for consecutive invalid lines
    i=0
#if IP is set, just print and go to next record
    if ($2!="*") {
        print ; oldip=$2 ; oldlat=$3 ; next
    }
#otherwise get following line and increase counter
    else {
#getline exit status => fails for the last line
        while (getline > 0) {i++
#check if new line has IP, if so
#set IPold:IPnew and average latency value
            if ($2!="*") {
                ipfill=oldip":"$2 ; latfill=(oldlat+$3)/2
#print filler lines for all consecutive records without value
                for (j=1 ; j<=i ; j++) {
                    print NR-i+j-1,ipfill,latfill
#alternative printing with oldIP:newIP:counter
#                   print NR-i+j-1,ipfill":"j,latfill
                }
#save current IP+lat and print "good" line
                oldp=$2; oldlat=$3
                print ; next
            }
        }
    }
#in case getline failed => all previous lines had no value
#just fill them with N/A data as in input
    for (j=0 ; j<=i ; j++) {
        print NR-i+j,"*","#"
    }
}

#If leading lines have no IP value, print them until IP is found
oldip == "" { if ($2=="*") {print ; next} ; oldip=$2 ; oldlat=$3 ; print }

Input:

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 * #
7 11.22.44.66 10
8 * #
9 * #
10 8.8.8.0 25
11 * #
12 * #
13 * #

Output:

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7.5
7 11.22.44.66 10
8 11.22.33.55:8.8.8.0 17.5
9 11.22.33.55:8.8.8.0 17.5
10 8.8.8.0 25
11 * #
12 * #
13 * #

Alternative output with counter for calculated lines:

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66:1 7.5
7 11.22.44.66 10
8 11.22.33.55:8.8.8.0:1 17.5
9 11.22.33.55:8.8.8.0:2 17.5
10 8.8.8.0 25
11 * #
12 * #
13 * #
5
  • It'll take me a minute to wrap my brain around this but it worked as soon as I figured out how to use it. Thanks! Commented Jan 26, 2022 at 18:58
  • @DariusDauer feel free to ask if stuck.
    – FelixJN
    Commented Jan 26, 2022 at 19:03
  • 1
    @EdMorton I see, corrected to while (getline > 0) - I missed this in the manuals: "If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1". The problem regarding output is a bit confusing regarding OPs requirement of "average", but surely can be handled with printf "%s %s %d\n" as alternative.
    – FelixJN
    Commented Jan 26, 2022 at 20:49
  • 1
    @EdMorton it is working. I just "changed the goalpost" When he copied my original input data it had line 6 ending in 10. but I edited line 6 later on to show 9 to make the math prettier. His script works right with the data he used. Commented Jan 26, 2022 at 20:49
  • 1
    @FelixJN yeah, getline is a slippery one which is why I wrote awk.freeshell.org/AllAboutGetline. It's one of those things that looks easy to use but requires a lot of thought and defensive coding and makes future enhancements harder. It's usually simpler to just not use it and let awk's normal processing loop handle the input.
    – Ed Morton
    Commented Jan 26, 2022 at 20:52
3
$ cat tst.awk
$2 == "*" {
    buf[++bufSz] = $0
    next
}
bufSz > 0 {
    split(prev,p)
    rng = p[2] ":" $2
    val = ($3 + p[3]) / 2
    for (i=1; i<=bufSz; i++) {
        split(buf[i],flds)
        print (prev == "" ? buf[i] : flds[1] OFS rng OFS val)
    }
    bufSz = 0
}
{
    print
    prev = $0
}
END {
    for (i=1; i<=bufSz; i++) {
        print buf[i]
    }
}

$ awk -f tst.awk file
1 192.168.200.2 1
2 192.168.200.1 1
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7
7 11.22.44.66 9
8 11.22.44.66:8.8.8.0 17
9 11.22.44.66:8.8.8.0 17
10 8.8.8.0 25
11 * #
12 * #
13 * #
3
  • This fails if the file starts with empty data (i.e. 1 * # as first line). Expected is to see unchanged lines as from the input in this case.
    – FelixJN
    Commented Jan 26, 2022 at 21:02
  • @FelixJN that depends on the OPs currently requirements for that case, hence my question. They may simply never have that situation, idk, but obviously it's not part of the sample input/output they provided for us to test with.
    – Ed Morton
    Commented Jan 26, 2022 at 22:04
  • 1
    @FelixJN actually, I see now the OP did say in the text If the missing data is on the first or last line, I'm happy to leave it as is so I updated my answer to accommodate that.
    – Ed Morton
    Commented Jan 26, 2022 at 22:48
0

With GNU sed using it's extended regex mode (-E)

S='(\S+)'; _re="$S $S"
re="^$_re\\n$_re\$"
_avg='1k\4 \2+2/f'
avg='"$(echo '"'$_avg'"'|dc)"'

sed -E '
  s/^(\S+ )[*] #(\n.*\n(.*))/\1\3\2/
  ta
  s/\n.*//

  /[*] #$/b
  $!N;//!ba

  :loop
  ${//q;bb}
  N;//bloop

  :b;h
  s/\n.*\n/\n/
  s/^\S+ //Mg'"
  s#$re#echo '\1:\3' $avg#e
  x;G

  :a
  P;D
" file

Perl using the range operator

perl -lane 'print,next
  unless my $e = /\d$/ ... /\d$/;
  push @A,[@F]; next
  unless $e =~ /E0/ || eof;
  if (@A>2&&$A[-1][-1] =~ /\d/) {
    my($str,$avg);
    for (0,-1) {
      $avg += $A[$_][2] / 2.0;
      $str .= $A[$_][1] . ":";
    }
    $str =~ s/.$//;
    @{$A[$_]}[1,2] = ($str,$avg)
      for 1..$#A-1;
  }
  print "@$_" for splice @A,0,@A-(eof?0:1);
  @A=(); redo if ! eof;
' file

python3 -c 'import sys, itertools as it
prev = ""
p = lambda x: print(*x,sep="",end="")
q = lambda x: x.split()
g = lambda x: x.endswith("* #\n")
with open(sys.argv[1]) as f:
  for t in it.groupby(f,g):
    G = list(t[1])
    if prev == "":
      p(G)
      if not t[0]: prev = G[-1]
    else:
      if t[0]: M = G
      else:
        a,b = map(q,[prev,G[0]])
        x = f"{a[1]}:{b[1]}"
        y = sum(map(int,[a[2],b[2]]))/2.0
        for l in M:
          for e in q(l)[0]:
            print(e,x,y)
        p(G); prev = G[-1]
  p(M)
' file

$ cat file
1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 * #
7 11.22.44.66 10
8 * #
9 * #
10 8.8.8.0 25
11 * #
12 * #
13 * #

Output;

1 * #
2 * #
3 10.10.10.1 1
4 11.22.33.44 2
5 11.22.33.55 5
6 11.22.33.55:11.22.44.66 7.5
7 11.22.44.66 10
8 11.22.44.66:8.8.8.0 17.5
9 11.22.44.66:8.8.8.0 17.5
10 8.8.8.0 25
11 * #
12 * #
13 * #
0

If the file is not huge, we can use the slurp mode (-0777) to solve this:

perl -MList::Util=sum -0777 -pe '
  s{(.*\d\n)
    ((?-x:.*\* \#\n)+)
    (?=(.*\d\n))
  }[$1 . do{my($s,@A);
    push @A,(split)[1,2] for $1,$3;
    $s = join ":", @A[0,2];
    @A = ("", $s, sum(@A[1,3])/2);
    $2 =~ s/^\S+\K.*/@A/gmr;
  }]xge;
' file
-2
var1=$(awk '{a[++i]=$0}/#/{for(x=NR-1;x<NR;x++)print a[x]}' file.txt | awk '{print $NF}')
var2=$(awk '/#/{x=NR+1}(NR==x){print $NF}' file.txt)

sed -i "s/#/$var3/g" file.txt
sed -i "s/\*/Cindy:Ed/g" file.txt


output

cat file.txt
line    person  age
1       Adam    45
2       Bob     50
3       Cindy   47
4       Cindy:Ed       48
5       Ed      49
1
  • Read the question again, you can’t hard-code “Cindy:Ed”. Where’s var3 set? Commented Jan 28, 2022 at 7:41

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.