2

Given the files below:

file1:

7997,1
7997,2
7997,3
5114,1
5114,2

file2:

7997,52,
5114,12,
4221,52,

How can I create an array from the 1st file that has the first column as indices and the second as values to be compared with data in file2 in awk?

Something like this:

cat file1 file2 | awk -F, '{if(NF==2){arr[$1]=$2}else{if(arr[$1]){print arr[$1]","$0}}}'

Where the desired output would be:

1,2,3,7997,52
1,2,5114,12

3 Answers 3

2

Here's one way:

$ awk -F, -vOFS=, 'NR==FNR{a[$1]=a[$1]","$2; next} 
                   ($1 in a){print a[$1],$0}' file1 file2 | 
    sed 's/^,\(.*\),$/\1/'
1,2,3,7997,52
1,2,5114,12

Explanation

  • -F, -vOFS=, : this sets the input field separator (-F) and the output field separator (-vOFS, this is the string inserted between each printed value when you run print $1,$2) to a comma.

  • NR==FNR{a[$1]=a[$1]","$2; next} : FNR is the line number of the current file and NR is the line number of the input. When awk is given two files to read, these variables will be equal only while reading the first file. So, the first block, NR==FNR{} will only be executed while the 1st file is being read.

    The code in this block will create the array a with the 1st field as an index. Each time the block is executed, it appends a comma and the value of the second field to whatever is stored in the array at the index of $1. The next jumps to the the next input line without continuing the script, that way the second block won't be executed for the first file.

    Since the first time it will run, a[$1] will be empty, this will add an extra comma to the beginning of the array. We remove that with the sed at the end.

  • ($1 in a){print a[$1],$0} : we are now in the second file. If the 1st field of this line is an index in the array a, then print the value associated with that index in a and the current line ($0).

  • sed 's/^,\(.*\),$/\1/' : this matches the first comma of the line (^,), then uses parentheses to capture everything except the last comma (\(.*\),$). The entire thing is then replaced with the captured pattern (\1). The result is that it simply removes the first and last comma from each line. This is needed to remove the extra comma added at the beginning of the line by the awk script and the extra comma included at the end of each line in file2. I am removing the latter since you also don't show it in your desired output.

4
  • can multi dimensional array be useful here? without using sed.
    – Eng7
    Commented Nov 26, 2015 at 11:42
  • @Eng7 yes, they could but is the trailing comma actually in file2? Sed was just one of many ways of dealing with this.
    – terdon
    Commented Nov 26, 2015 at 12:45
  • we can ignore the comma.
    – Eng7
    Commented Nov 26, 2015 at 12:50
  • @Eng7 on second thought, multidimensional arrays won't really help here since awk doesn't really support them as such. It concatenates indices to mimic them. They'd me more trouble than they're worth here. To avoid the issue with the comma, you can use the approach in JijinP's answer which checks whether an array element has already been defined.
    – terdon
    Commented Nov 26, 2015 at 13:30
2

You can use FNR and NR variables to achieve this.

awk -F "," '{
  if(FNR==NR){
    if (a[$1] != ""){
      a[$1]=a[$1]","$2
      }
    else{
      a[$1]=$2
      }
    }
    else{
      if (a[$1]!= ""){
        print a[$1]","$1","$2
        }
      }
    }' file1 file2
1

Starting from Jijin P's perfectly good answer and tightening up the logic a bit. This was going to be a comment on his answer originally, then it became too long (and it is a valid answer itself) so here goes:

awk 'BEGIN {
  FS = ","
  OFS = ","
}

FNR == NR {
  if ($1 in a) {
    a[$1] = a[$1] OFS $2
  } else {
    a[$1] = $2
  }
  next
}

$1 in a {
  print a[$1], $1, $2
}' file1 file2

In general it is better to use if ($x in myarray) instead of if (myarray[$x] != ""), unless you have a specific reason not to. If you just want to be sure that element of the array hasn't been created, use the first version. If you know it has been created and want to ensure it isn't a blank string, use the second. The trick with the second is that just by naming the array element myarray[$x], even in the context of checking its value, the element is silently created. This can mess you up in some cases when you go to print the array using for (index in myarray).

And, when using print var1 "," var2 "," var3, this is the exact use case for which OFS (output field separator) exists. Setting OFS in the BEGIN block makes it easy and fast to change the output format for the whole script.

Lastly, when performing one action for the first file and a different action for second/other files, a patterned block with FNR == NR ending in a next statement is cleaner in my opinion than an if/else block.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.