1

I have two tab separated files, each with two columns. I want to create a file which contains overlapping elements by column 1 of the two files. To do so, I put file 1 in an array first then scanned the array to check against file 2 for overlaps. However, somehow the index of the array cannot be recognized. See below for the elaboration of the problem.

The first 3 lines of the files look like this:

File 1:

90001   raw acceleration data
2634    Heavy DIY
1011    Light DIY

File 2:

2634    218263
25680   44313
25681   44313

To show that there are overlaps in column 1 of the two files:

user@cluster:~> grep 90001 file2
90001   103662
user@cluster:~> grep 2634 file2
2634    218263

To create file 3, I tried this first, which yielded an empty file.

awk 'BEGIN {FS = "\t"; OFS= "\t"} 
 NR==FNR {a[$1]=$2; next}
 { if($1 in a) print $1, a[$1]}' file1 file2 > file3

The following code confirmed the issue is the index of the array was not recognized; because adding the else line actually prints file2 into file3.

awk 'BEGIN {FS = "\t"; OFS= "\t"} 
 NR==FNR {a[$1]=$2; next}
 {if($1 in a) 
      print $1, a[$1]
   else 
      print $1, $2}' file1 file2 > file3

I am quite puzzled. I wonder what might have caused the issue and how I can fix it? Thanks in advance.

7
  • What's the output of LC_ALL=C sed -n l file1 file2? (l being lowercase L, not the digit 1) Commented Mar 29, 2023 at 9:14
  • 1
    Double check that there is actually a single literal tab character between the two fields on every line in the two files. I can not reproduce the issue locally with the files that you show (with tab as the delimiter).
    – Kusalananda
    Commented Mar 29, 2023 at 9:15
  • @StéphaneChazelas. Here is the first line of the output 90001\r\traw acceleration data$ I wonder what is \r\t doing there. Sorry I am not familiar with sed. Thanks.
    – Xuan
    Commented Mar 29, 2023 at 11:27
  • @Kusalananda Sorry. I just edited the original post by copying and pasting the actual first 3 lines of the two files. Before the edit, I just typed in the entries to show what they look like...
    – Xuan
    Commented Mar 29, 2023 at 11:30
  • @EdMorton Oh thanks! I did not realize\r is part of the field.
    – Xuan
    Commented Mar 29, 2023 at 11:38

1 Answer 1

1

From your comment:

Here is the first line of the output 90001\r\traw acceleration data$

your first field is 90001\r, not 90001 so change FS = "\t" to FS = "\r?\t" to accommodate that \r in the input or add { sub(/\r/,"") } or similar to the start of your script to remove it.

See why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it for more info on \rs (Carriage Returns) in input files. They're usually at the end of lines rather mid-line though - your current problem is probably a result of some previous phase re-ordering the fields or attaching strings to the end of each line in a previous version of the file and not stripping the \rs then.

By the way, consider writing:

if($1 in a) 
      print $1, a[$1]
   else 
      print $1, $2

as a ternary expression:

print $1, ($1 in a ? a[$1] : $2)

to avoid writing quite so much code and duplicating print $1,. Also consider changing this:

FS = "\t"; OFS= "\t"

to this:

FS=OFS="\t"

for the same reason - less duplication and more concise code.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.