I'm working on wrangling some data into a matrix format for a PCA analysis. I have attempted a more straightforward version with R, but pivot_wider() fails due to the massive size of this dataset (~4.5M rows).
Anyway, I do have ~450 files organized as such
HG00097 sample, haplotype 1 chromosome 1 and X
HG00097#1_chr1 D1000 15
HG00097#1_chr1 D1001 196
HG00097#1_chr1 D1002 48
HG00097#1_chr1 D1003 55
HG00097#1_chr1 D1012 38
HG00097#1_chr1 D1018 2
HG00097#1_chrX D966856 1
HG00097#1_chrX D974 1
HG00099 sample, haplotype 2 chromosome 2 and X
HG00099#2_chr2 D1000 6
HG00099#2_chr2 D1001 36
HG00099#2_chr2 D1002 5
HG00099#2_chr2 D1003 22
HG00099#2_chr2 D1012 1
HG00099#2_chr2 D1013 1
HG00099#2_chrX D963841 1
HG00099#2_chrX D974 1
HG00126 sample, haplotype 1 chromosome 3 and Y
HG00126#1_chr3 D1000 10
HG00126#1_chr3 D1001 45
HG00126#1_chr3 D1002 6
HG00126#1_chr3 D1003 18
HG00126#1_chr3 D1004 20
HG00126#1_chr3 D1005 64
HG00126#1_chrY D998 1
HG00126#1_chrY D999 1
where the first column is a unique identifier for [sample]#[hap (either 1 or 2)]_[chr (from 1 to 22 followed by either X or Y)], the second is a value, and the third how many times that value appears in a specific chromosome.
What I need to do is to organize all those files in one matrix with all individuals stacked in the first row as rownames, and those D values as headers for the cells containing the corresponding occurrences. The idea is that a D value has to be considered only once globally and for those where one sample is missing it, that has to be replaced with 0.
Based on the two samples above I should get the following:
| | D1000 | D1001 | D1002 | D1003 | D1004 | D1005 | D1012 | D1013 | D1018 | D974 | D998 | D999 | D963841 | D966856 |
|----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|------|------|------|---------|---------|
| HG00097#1_chr1 | 15 | 196 | 48 | 55 | 0 | 0 | 38 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| HG00099#2_chr2 | 6 | 36 | 5 | 22 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| HG00126#1_chr3 | 10 | 45 | 6 | 18 | 20 | 64 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| HG00097#1_chrX | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| HG00099#2_chrX | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| HG00126#1_chrY | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Additional info:
- files for the two haplotypes (#1 and #2) are separated and sorted according to the
Dvalue from smallest to largest within each chromosome, and chromosomes in both files are sorted from 1 to 22 with X or Y at the end - samples can either have an XX or XY, if the first is true both haplotype 1 and 2 (#1 and #2) have an X at the end, otherwise #1 has a Y and #2 has an X
I was considering to use AWK for the task but I'm not even sure from where to start... any help is appreciated. Let me know whether additional info is needed, thanks!
awkis unlikely to fare any better.#1or#2, right? You never have bothHG...#1andHG...#2in the same file? And you don't actually want the|and-in the output, right? What should the field separators be? Tabs perhaps? Oh, and you have nochrMorMTthen?followed by either X or Y, c) any sorting requirements for the final result and d) the actualR/pivot_wider()error message(s)