index of an array is not recognized in awk

Question

I have two tab separated files, each with two columns. I want to create a file which contains overlapping elements by column 1 of the two files. To do so, I put file 1 in an array first then scanned the array to check against file 2 for overlaps. However, somehow the index of the array cannot be recognized. See below for the elaboration of the problem.

The first 3 lines of the files look like this:

File 1:

90001   raw acceleration data
2634    Heavy DIY
1011    Light DIY

File 2:

2634    218263
25680   44313
25681   44313

To show that there are overlaps in column 1 of the two files:

user@cluster:~> grep 90001 file2
90001   103662
user@cluster:~> grep 2634 file2
2634    218263

To create file 3, I tried this first, which yielded an empty file.

awk 'BEGIN {FS = "\t"; OFS= "\t"} 
 NR==FNR {a[$1]=$2; next}
 { if($1 in a) print $1, a[$1]}' file1 file2 > file3

The following code confirmed the issue is the index of the array was not recognized; because adding the else line actually prints file2 into file3.

awk 'BEGIN {FS = "\t"; OFS= "\t"} 
 NR==FNR {a[$1]=$2; next}
 {if($1 in a) 
      print $1, a[$1]
   else 
      print $1, $2}' file1 file2 > file3

I am quite puzzled. I wonder what might have caused the issue and how I can fix it? Thanks in advance.

What's the output of LC_ALL=C sed -n l file1 file2? (l being lowercase L, not the digit 1) — Stéphane Chazelas, Commented Mar 29, 2023 at 9:14
Double check that there is actually a single literal tab character between the two fields on every line in the two files. I can not reproduce the issue locally with the files that you show (with tab as the delimiter). — Kusalananda, Commented Mar 29, 2023 at 9:15
@StéphaneChazelas. Here is the first line of the output 90001\r\traw acceleration data$ I wonder what is \r\t doing there. Sorry I am not familiar with sed. Thanks. — Xuan, Commented Mar 29, 2023 at 11:27
@Kusalananda Sorry. I just edited the original post by copying and pasting the actual first 3 lines of the two files. Before the edit, I just typed in the entries to show what they look like... — Xuan, Commented Mar 29, 2023 at 11:30
@EdMorton Oh thanks! I did not realize\r is part of the field. — Xuan, Commented Mar 29, 2023 at 11:38

Ed Morton · Accepted Answer · 2023-03-29 11:51:15Z

From your comment:

Here is the first line of the output 90001\r\traw acceleration data$

your first field is 90001\r, not 90001 so change FS = "\t" to FS = "\r?\t" to accommodate that \r in the input or add { sub(/\r/,"") } or similar to the start of your script to remove it.

See why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it for more info on \rs (Carriage Returns) in input files. They're usually at the end of lines rather mid-line though - your current problem is probably a result of some previous phase re-ordering the fields or attaching strings to the end of each line in a previous version of the file and not stripping the \rs then.

By the way, consider writing:

if($1 in a) 
      print $1, a[$1]
   else 
      print $1, $2

as a ternary expression:

print $1, ($1 in a ? a[$1] : $2)

to avoid writing quite so much code and duplicating print $1,. Also consider changing this:

FS = "\t"; OFS= "\t"

to this:

FS=OFS="\t"

for the same reason - less duplication and more concise code.

Stack Exchange Network

index of an array is not recognized in awk

1 Answer 1

You must log in to answer this question.

Hot Network Questions

index of an array is not recognized in awk

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions