3

I'm working on wrangling some data into a matrix format for a PCA analysis. I have attempted a more straightforward version with R, but pivot_wider() fails due to the massive size of this dataset (~4.5M rows).

Anyway, I do have ~450 files organized as such

HG00097 sample, haplotype 1 chromosome 1 and X

HG00097#1_chr1  D1000   15
HG00097#1_chr1  D1001   196
HG00097#1_chr1  D1002   48
HG00097#1_chr1  D1003   55
HG00097#1_chr1  D1012   38
HG00097#1_chr1  D1018   2
HG00097#1_chrX  D966856 1
HG00097#1_chrX  D974    1

HG00099 sample, haplotype 2 chromosome 2 and X

HG00099#2_chr2  D1000   6
HG00099#2_chr2  D1001   36
HG00099#2_chr2  D1002   5
HG00099#2_chr2  D1003   22
HG00099#2_chr2  D1012   1
HG00099#2_chr2  D1013   1
HG00099#2_chrX  D963841 1
HG00099#2_chrX  D974    1

HG00126 sample, haplotype 1 chromosome 3 and Y

HG00126#1_chr3  D1000   10
HG00126#1_chr3  D1001   45
HG00126#1_chr3  D1002   6
HG00126#1_chr3  D1003   18
HG00126#1_chr3  D1004   20
HG00126#1_chr3  D1005   64
HG00126#1_chrY  D998    1
HG00126#1_chrY  D999    1

where the first column is a unique identifier for [sample]#[hap (either 1 or 2)]_[chr (from 1 to 22 followed by either X or Y)], the second is a value, and the third how many times that value appears in a specific chromosome.

What I need to do is to organize all those files in one matrix with all individuals stacked in the first row as rownames, and those D values as headers for the cells containing the corresponding occurrences. The idea is that a D value has to be considered only once globally and for those where one sample is missing it, that has to be replaced with 0.

Based on the two samples above I should get the following:

|                | D1000 | D1001 | D1002 | D1003 | D1004 | D1005 | D1012 | D1013 | D1018 | D974 | D998 | D999 | D963841 | D966856 |
|----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|------|------|------|---------|---------|
| HG00097#1_chr1 | 15    | 196   | 48    | 55    | 0     | 0     | 38    | 0     | 2     | 0    | 0    | 0    | 0       | 0       |
| HG00099#2_chr2 | 6     | 36    | 5     | 22    | 0     | 0     | 1     | 1     | 0     | 0    | 0    | 0    | 0       | 0       |
| HG00126#1_chr3 | 10    | 45    | 6     | 18    | 20    | 64    | 0     | 0     | 0     | 0    | 0    | 0    | 0       | 0       |
| HG00097#1_chrX | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 1    | 0    | 0    | 0       | 1       |
| HG00099#2_chrX | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 1    | 0    | 0    | 1       | 0       |
| HG00126#1_chrY | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0     | 0    | 1    | 1    | 0       | 0       |

Additional info:

  • files for the two haplotypes (#1 and #2) are separated and sorted according to the D value from smallest to largest within each chromosome, and chromosomes in both files are sorted from 1 to 22 with X or Y at the end
  • samples can either have an XX or XY, if the first is true both haplotype 1 and 2 (#1 and #2) have an X at the end, otherwise #1 has a Y and #2 has an X

I was considering to use AWK for the task but I'm not even sure from where to start... any help is appreciated. Let me know whether additional info is needed, thanks!

8
  • if R fails, awk is unlikely to fare any better. Commented Nov 24 at 20:30
  • So each file will only have either #1 or #2, right? You never have both HG...#1 and HG...#2 in the same file? And you don't actually want the | and - in the output, right? What should the field separators be? Tabs perhaps? Oh, and you have no chrM or MT then? Commented Nov 24 at 20:57
  • 1
    please update the question with a) actual file names for the 2 samples, b) example of an identifier that demonstrates followed by either X or Y, c) any sorting requirements for the final result and d) the actual R/pivot_wider() error message(s) Commented Nov 25 at 16:19
  • 1
    What do you mean by 4.5M rows with 450 files? Does that mean about 10,000 lines per file so at least 10000 columns in the resulting output? Commented Nov 25 at 20:32
  • 2
    seems better to have a sample that match your requirements. Please edit the question to add the missing requirements. Commented Nov 26 at 8:09

4 Answers 4

5

I'd use perl instead:

$ perl -lane '
  $f{$F[0]}->{$F[1]} = $F[2];
  $d{$F[1]} = 0;
  END {
    $, = "\t";
    @d = sort keys %d;
    print "", @d;
    for $f (sort keys %f) {
      print $f, map {$f{$f}->{$_} // 0} @d;
    }
  }' -- file1 file2 | mlr --t2p --barred cat
+----------------+-------+-------+-------+-------+-------+-------+-------+
|                | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 |
+----------------+-------+-------+-------+-------+-------+-------+-------+
| HG00097#1_chr1 | 15    | 196   | 48    | 55    | 38    | 0     | 2     |
| HG00099#2_chr2 | 6     | 36    | 5     | 22    | 1     | 1     | 0     |
+----------------+-------+-------+-------+-------+-------+-------+-------+

Or an English version:

perl -MEnglish -lane '
  # for each line of input
  ($sample, $value, $count) = @F;
  $count{$sample}->{$value} = $count;
  $all_known_values{$value} = undef;

  END {
    $OUTPUT_FIELD_SEPARATOR = "\N{CHARACTER TABULATION}";
    @all_known_values = sort keys %all_known_values;
    print "Sample", @all_known_values;
    foreach $sample (sort keys %count) {
      print $sample, map {$count{$sample}->{$ARG} // 0} @all_known_values;
    }
  }' -- file1 file2

perl generates tab-separated-values output. mlr is just to pretty print it. Change "\t" to "," and t2p to c2p for csv instead.

file1 and file2 here are two files with the examples you gave. You'd give the full list instead.

Note that the D<n> values and sample IDs are sorted lexically, so HG00126#1_chr22 would come before HG00126#1_chr3 and D999 after D1000.

To sort alpha-numerically, you could change to:

perl -lane '
  ($sample, $value, $count) = ($F[0], substr($F[1], 1), $F[2]);
  $count{$sample}->{$value} = $count;
  $all_known_values{$value} = undef;
  END {
    $, = "\t";
    @all_known_values = sort {$a <=> $b} keys %all_known_values;
    print "Sample", map {"D$_"} @all_known_values;
    sub padchr {$_[0] =~ s/chr\K(?=\d\z)/0/r}
    foreach $sample (sort {padchr($a) cmp padchr($b)} keys %count) {
      print $sample, map {$count{$sample}->{$_} // 0} @all_known_values;
    }
  }' -- file1 file2

Where we 0-pad the chromosome number by prefixing it with 0 if single digit when doing the sort comparison, and store the D<n> value without the D prefix and compare numerically with <=> instead of the default of cmp.

6
  • thanks for this perl version, I'm not too familiar with it but it's quite concise; it's a bit slower than others. To answer your pervious question the 4.5M refers to the concat size I needed as an input for the R pivot_wider() function, each file has ~5000 or so rows. Sorry, it was an inaccuracy on my side. Commented Nov 26 at 8:00
  • @Matteo. Is that with or without the mlr post-processing? on my test, without mlr, with about 500k rows total, it's faster than the awk one (25s vs 90s) while the mlr one crashes for me. I'd have expected the awk one (which makes some assumptions on how the values are distributed between files and doesn't use up much memory) to be faster though. Commented Nov 26 at 8:34
  • without the post-process as I will need a conventional TAB-separated matrix. However, following Kusalananda suggestion I'm testing myself mlr and I must say is taking pretty much the same time as perl... P. S. I will add the X and Y cases shortly! Commented Nov 26 at 8:38
  • What's your version of perl? Mine's 5.40.1 (and gawk 5.3.1 and mawk 1.3.4 20250131 (47s with mawk)) Commented Nov 26 at 8:41
  • perl: v5.34.0, GNU awk: v5.1.0 and mawk cannot see with the --version command... Commented Nov 26 at 8:43
4

Using Miller (mlr) to read the input as whitespace-delimited fields, reshape it by using the values found in the 2nd field as column labels, with the corresponding values taken from the 3rd field:

$ cat file* | mlr --n2p --barred reshape -s 2,3 then unsparsify --fill-with 0 then label id
+----------------+-------+-------+-------+-------+-------+-------+-------+
| id             | D1000 | D1001 | D1002 | D1003 | D1012 | D1018 | D1013 |
+----------------+-------+-------+-------+-------+-------+-------+-------+
| HG00097#1_chr1 | 15    | 196   | 48    | 55    | 38    | 2     | 0     |
| HG00099#2_chr2 | 6     | 36    | 5     | 22    | 1     | 0     | 1     |
+----------------+-------+-------+-------+-------+-------+-------+-------+

Or, for CSV output:

$ cat file* | mlr --n2c reshape -s 2,3 then unsparsify --fill-with 0 then label id
id,D1000,D1001,D1002,D1003,D1012,D1018,D1013
HG00097#1_chr1,15,196,48,55,38,2,0
HG00099#2_chr2,6,36,5,22,1,0,1

Since CSV (and the pretty-print format used in the first example) requires that every record has the same number of fields, we use the unsparsify operation to add zeros to the fields that don't have values. We also use the label operation to set the first field's label to id (its label would otherwise default to 1).

If you need to sort the fields lexicographically, then do so with sort-within-records before relabeling the first field (doing it after would make the first field be sorted to the end, in this case):

mlr --n2c reshape -s 2,3 then unsparsify --fill-with 0 then sort-within-records then label id

Testing on 4.5 million lines of input (9000 unique D values), this runs in well under a minute (about 26s on my system, using about 4.6 GB of RAM).

13
  • I guess it depends on what those 4.5M rows are. If there's only a handful of distinct DXXXX values between them, then that would be fine. But if it spans the whole of D0000 to D9999, then I find it crashes with not enough memory with much fewer than 4.5M lines. Commented Nov 26 at 7:06
  • @Kusalananda thanks a lot! Unfortunately, I couldn't test since on the machine where I work there is no mlr; however, I would be really interested in testing the speed of it! Commented Nov 26 at 7:31
  • @StéphaneChazelas I just now tested with 9000 unique D-values and memory consumption rose to about 4.6 GB. Runtime at about 26s (tested with hyperfine). Commented Nov 26 at 7:33
  • 1
    @Matteo Note that I'm running on simulated data (I wrote a short Python script to generate it). Memory consumption and runtime will depend heavily on the number of unique D values and samples (HG values). Commented Nov 26 at 7:45
  • 1
    @Matteo regarding "perl settles on ~4.6GB of RAM, haven't checked awk yet but I doubt it will change drastically." - assuming the awk script you're referring to is the awk+sort solution I posted, I'd be interested to know the outcome of you checking it with the same data you used for perl and mlr. Commented Nov 26 at 13:31
3

Using any POSIX awk plus sort:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN {
        OFS = "\t"
        while ( (getline < ARGV[1]) > 0 ) {
            haps[++numHaps] = $2
            vals[$2] = $2
        }
        ARGV[1] = ""
        sample = "Sample"
    }
    sample != $1 {
        prt()
        sample = $1
    }
    { vals[$2] = $3 }
    END { prt() }

    function prt(       hapNr) {
        printf "%s", sample
        for ( hapNr=1; hapNr<=numHaps; hapNr++ ) {
            printf "%s%s", OFS, vals[haps[hapNr]]
            vals[haps[hapNr]] = 0
        }
        print ""
    }
' <(sort -u -k2,2 "$@") "$@"

$ ./tst.sh file{1..3}
Sample  D1000   D1001   D1002   D1003   D1004   D1005   D1012   D1013   D1018   D963841 D966856 D974    D998    D999
HG00097#1_chr1  15      196     48      55      0       0       38      0       2       0       0       0       0       0
HG00097#1_chrX  0       0       0       0       0       0       0       0       0       0       1       1       0       0
HG00099#2_chr2  6       36      5       22      0       0       1       1       0       0       0       0       0       0
HG00099#2_chrX  0       0       0       0       0       0       0       0       0       1       0       1       0       0
HG00126#1_chr3  10      45      6       18      20      64      0       0       0       0       0       0       0       0
HG00126#1_chrY  0       0       0       0       0       0       0       0       0       0       0       0       1       1

That reads all the unique haplotypes plus the haplotypes+counts from just 1 sample at a time into memory in awk (as opposed to reading all of the values from all of the files into memory at once) and uses sort to just sort the haplotypes uniquely (to print the header line) so it should have no problem with the amount of input you describe.

You can pipe the above tab-separated output to whatever you like to format it differently, e.g.:

$ ./tst.sh file{1..3} | column -t -s$'\t' -o' | ' -R0
        Sample | D1000 | D1001 | D1002 | D1003 | D1004 | D1005 | D1012 | D1013 | D1018 | D963841 | D966856 | D974 | D998 | D999
HG00097#1_chr1 |    15 |   196 |    48 |    55 |     0 |     0 |    38 |     0 |     2 |       0 |       0 |    0 |    0 |    0
HG00097#1_chrX |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |       0 |       1 |    1 |    0 |    0
HG00099#2_chr2 |     6 |    36 |     5 |    22 |     0 |     0 |     1 |     1 |     0 |       0 |       0 |    0 |    0 |    0
HG00099#2_chrX |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |       1 |       0 |    1 |    0 |    0
HG00126#1_chr3 |    10 |    45 |     6 |    18 |    20 |    64 |     0 |     0 |     0 |       0 |       0 |    0 |    0 |    0
HG00126#1_chrY |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |       0 |       0 |    0 |    1 |    1

or you could add some formatting code to the awk script if you prefer.

The above was tested using these 3 input files from the question:

$ head file{1..3}
==> file1 <==
HG00097#1_chr1  D1000   15
HG00097#1_chr1  D1001   196
HG00097#1_chr1  D1002   48
HG00097#1_chr1  D1003   55
HG00097#1_chr1  D1012   38
HG00097#1_chr1  D1018   2
HG00097#1_chrX  D966856 1
HG00097#1_chrX  D974    1

==> file2 <==
HG00099#2_chr2  D1000   6
HG00099#2_chr2  D1001   36
HG00099#2_chr2  D1002   5
HG00099#2_chr2  D1003   22
HG00099#2_chr2  D1012   1
HG00099#2_chr2  D1013   1
HG00099#2_chrX  D963841 1
HG00099#2_chrX  D974    1

==> file3 <==
HG00126#1_chr3  D1000   10
HG00126#1_chr3  D1001   45
HG00126#1_chr3  D1002   6
HG00126#1_chr3  D1003   18
HG00126#1_chr3  D1004   20
HG00126#1_chr3  D1005   64
HG00126#1_chrY  D998    1
HG00126#1_chrY  D999    1

I used getline above, if you're ever considering using it please make sure to read http://awk.freeshell.org/AllAboutGetline first.

3
  • Mmm that's interesting the perl version generates an output of 10580 lines, whereas the awk one of only 465. Now, they both agree on the number of header cols (the D pattern) in 330303; however, the awk script seems to capture only everything up to chr1 when I do have chr in all files that go up to 22 with the addition of either X or Y. It happened to be clear to me since I have 464 input files, and 464+1=465 (number of lines in the awk version) which is what I have seen upon inspecting visually the output. Commented yesterday
  • 1
    Your original sample input only had 1 value in the first column of each input file so I wrote my script under the assumption that that was how your data was organized. In your new sample input you have multiple values in the first column in each file so I've tweaked my script to accommodate that now, please try it again. Commented yesterday
  • 1
    thanks a lot. Now, both the awk and perl solution have the same output. I also had a look at getline to understand better what the script was doing. In terms of memory consumption awk performs better than perl while keeping a comparable execution time! Commented 8 hours ago
2

if R fails, awk is unlikely to fare any better. Your problem is that you need to pivot multiple large matrices.

the massive size of this dataset (~4.5M rows).

That does not sound massive to me; that's just 4.5 million integers and their position in a table? One could argue that's actually not very much data, and could fit on a good computer screen as pixels. Your computer computes dozens of screens per second, so that's not too much for it :)

Also, applying regexes through 4.5 million lines of text is not taking much time on a modern computer. So, I think you'll be fine!

This becomes very much data, however, if you're not handling it compactly.

I'd do this: as preliminary step (because 4.5 million rows are not that much, we can stand going through all the files in a preliminary processing step)

  • Determine the number of columns you need. It seems to me that they are all Dxxxx with xxxx being an integer smaller than 10,000? Either you can directly infer the unique values, or you need to count them.
  • Determine the number of rows you need. That probably means putting every occurring value of your first column in a (reverse indexable) set data structure, and checking the size of that set in the end.

Then,

  • allocate a matrix with that number of rows and columns (and have it be zerofilled, which is usually almost free)
  • for each row in your input, look up the row index in your row set, and derive the column index from the D-value. Put the number from your last column at that position in your table.

This matrix is what you seem to do PCA on. Go wild!


I'd implement that in a few lines of C++ or Python, but I'm taking bets R also has functionality to read files row-wise, split them on a delimiter, and store the first value in a set-like data structure, and so on.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.