2

I have input data which I want to parse and extract values using awk/grep/sed:

group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 28 29 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 29 30 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 10 11 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 11 12 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1

Basically,I want to take the distinct values in "from=" and its "fromid" and "to=" and its "toid=" which can be seen below as to how the output should be:

The desired output.has to be the values in "from=" and "to=" joined row wise.Since from=ABCB11 is present many times but I want only it once,so as the value in "to=" has to be once in the output.

Whatever is present as fromid or toid ,I want all rows to have fromid,after taking distinct values from both. The format of the output can be interpreted from below output:

ABCB11 = fromid=4,from=ABCB11
ABCC8 = fromid=5,from=ABCC8
ACE = fromid=11,from=ACE
CHRM1 = fromid=114,from=CHRM1
CHRM2 = fromid=115,from=CHRM2
DRD2 = fromid=158,from=DRD2
EGF = fromid=164,from=EGF
ADRA1A = fromid=21,from=ADRA1A
ADRA1B = fromid=22,from=ADRA1B
ADRA1D = fromid=23,from=ADRA1D

I want to have exactly the same output as above,but I have a new input file,which is below:

ABCB11  4   ACE 11
ABCB11  4   CHRM1   114
ABCB11  4   CHRM2   115
ABCB11  4   DRD2    158
ABCB11  4   EGF 164
ABCC8   5   ACE 11
ABCC8   5   ADRA1A  21
ABCC8   5   ADRA1B  22
ABCC8   5   ADRA1D  23
ABCC8   5   CHRM1   114

Taking all the unique genes and creating the output.

3
  • Why your to= in output is different with to= from input? Commented Apr 23, 2014 at 15:58
  • Its just the unique values that I want in my output.taking the unique of "from" and "to" and joining them row wise. Commented Apr 23, 2014 at 16:02
  • 1
    So can you correct your output to fit with input? Commented Apr 23, 2014 at 16:07

2 Answers 2

4

You could use an awk associative array indexed by the field whose uniqueness you are asserting e.g. for the unique values of the to= field (field $6 when split on commas):

$ awk -F, '{split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;} END{for (id in arr) print arr[id]}' data.txt
EGF = toid=164,to=EGF
ADRA1A = toid=21,to=ADRA1A
ACE = toid=11,to=ACE
ADRA1B = toid=22,to=ADRA1B
ADRA1D = toid=23,to=ADRA1D
DRD2 = toid=158,to=DRD2
CHRM1 = toid=114,to=CHRM1
CHRM2 = toid=115,to=CHRM2

The expression for the unique fromid entries is the same but replacing fields $6 and $7 with $2 and $3:

$ awk -F, '{split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;} END{for (id in arr) print arr[id]}' data.txt
ABCC8 = fromid=5,from=ABCC8
ABCB11 = fromid=4,from=ABCB11


If you want the output to contain both toid and fromid data, you can combine the expressions i.e.

awk -F, '{
split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;
split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;
} END{for (id in arr) print arr[id]}' data.txt

To change the labels (i.e. label all the fields in one table as toid even if they come from the fromid lines) probably the most natural way is to pipe the output through sed e.g.

$ awk -F, '{
split($2,s,"="); arr[s[2]]=s[2]" = "$3","$2;
split($6,s,"="); arr[s[2]]=s[2]" = "$7","$6;
} END{for (id in arr) print arr[id]}' data.txt | sed 's/from/to/g'
ABCC8 = toid=5,to=ABCC8
EGF = toid=164,to=EGF
ADRA1A = toid=21,to=ADRA1A
ACE = toid=11,to=ACE
ABCB11 = toid=4,to=ABCB11
ADRA1B = toid=22,to=ADRA1B
ADRA1D = toid=23,to=ADRA1D
DRD2 = toid=158,to=DRD2
CHRM1 = toid=114,to=CHRM1
CHRM2 = toid=115,to=CHRM2

You could make the fromid <--> toid substitutions inside awk but this method makes the intent clearer, I think. The other table can then be made just by changing the final sed expression to sed 's/to/from/g' instead.

10
  • It works great but I am not getting these two values ABCB11 = toid=4,to=ABCB11 ABCC8 = toid=5,to=ABCC8. Commented Apr 23, 2014 at 16:24
  • Can you clarify which field (or combination of fields) you want to test, and which you want to output? The expression I gave finds unique values of field $6 which is the to= field - it can be modified but I don't understand your requirements. Commented Apr 23, 2014 at 16:41
  • You do not have toid=4 and toid=5 in your question. Commented Apr 23, 2014 at 16:41
  • @steeldriver if you see the two outputs,irrespective of the from in input table,In the output I want distinct values present in "from" as well in my output.I have printed the exact two outputs that I want!I can do find and replace to get the other output,so even if I get one of them,it works for me. Commented Apr 23, 2014 at 16:50
  • @Ramesh I do not have toid=4 and toid=5 in my question,but I want my output like that.If you see the required output!! Commented Apr 23, 2014 at 16:52
1

Assuming that the names are in a file called "filename.txt", You can try the following for the first table:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//'

For the second table:

cat filename.txt | awk -F "," '{ print $2 " = " $3 "," $6}' | sed -r 's/^.{5}//'

Good luck!

EDIT: For the second table:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//' | sed 's/toid/fromid/'

EDIT 2:

cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed 's/^.....//' | sed 's/toid/fromid/'

these are 5 dots.

8
  • The OP wants unique values. So, you can use uniq at the end of the commands to get the exact output as the OP needs. Commented Apr 23, 2014 at 15:30
  • 1
    UUOC alert Commented Apr 23, 2014 at 15:38
  • sed command after pipe throws error sed: illegal option -- r Commented Apr 23, 2014 at 15:41
  • It doesn't procedure output as the OP show. Commented Apr 23, 2014 at 15:50
  • Sorry, I thought you were filtering for different columns. How about: cat filename.txt | awk -F "," '{ print $2 " = " $7 "," $6}' | sed -r 's/^.{5}//' | sed 's/toid/fromid/' Commented Apr 23, 2014 at 15:59

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.