2

Total noob here and I have, what I think is, a simple enough problem to solve that has taken me out completely.

I have a tab delimited data set:

NS500418:110:H2VY7BGXX:4:21601:20699:7042  chrV    8256382 True    CATCTAAATTTTGTTAGGATG   chrV    8256540 True    GAATAATAGAAGAGGTACAGA   CATCTAAATTTTGTTAGGATGTTCTTCCTCGCCTTTTCTTTCTTAATTTAAGACGTCAAAAAGCAGCATATGACAGGGATTCTGGTATTCCAATGAGATCATTTTACCAATGACGAAAAAATACGTGAGGTGTTGCAAAATGACACAAAA  GAATAATAGAAGAGGTACAGAAAACGTTTGTGACGTGAAAAATGCTAAAAGCTCAAGCAATGGGTGGTCTTCTAGAACTCTGAAGAAACTGTGTTTTGTTTTCATGATCTCGGGATGCTTCAAAAACTGAAATGGGTGTCAAAGCAGGCC  CATCTAAATTTTGTTAGGATGTTCTTCCTCGCCTTTT   GAATAATAGAAGAGGTACAGAAAACGTTTGTGACGTGA  chrV    8256416 chrV    8256566
M03109:43:000000000-ACGWU:1:1102:11826:4015 chrIII  7513608 False   TCGTTTTTTGTTCTCTAACAC   chrX    15229802    False   TTTTAAGTACTACCTAAGAACC  TTCGCATGGATGTTTGATCCGAGAATTGGAGCTATTCTTATGCCAGTTAGTTTTTTTTCGTTTTTTGTTCTCTAACAC  ATTTTGTGAAGCAATTTGGCCTTTTTTTAGTTGATCTAATTATGCGTAAACACAATTTTTAAGTACTACCTAAGAACC  GTGTTAGAGAACAAAAAACGAAAAAAAACTAACTGGCATAAGAATAGCTCCAATTCTCGGATCGATCTAATTATGCGT  GGTTCTTAGGTAGTACTTAAAAATTGTGTTTACGCATAATTAGATCGATCCGAGAATTGGAGCTATTCTTATGCCAGT  chrIII  7513540 chrX    15229776
NS500418:110:H2VY7BGXX:4:11407:17860:12911  chrX    4775576 True    GGATAGTTTTAATTTTCTTGG   chrX    16142498    True    GAGTACTGCCGCGCGATCGAT   GGATAGTTTTAATTTTCTTGGATATTTTTAAATTCCGCTTAAAAACAACATTGTTAAGTCCGTTTTCACAGTTTGGAACTTTCTGTAAAATTGAGACTGGGAAAACTTAATGAAATAAAAGAATAGGTGCTCTTTACAAATTAAAAACAA  GAGTACTGCCGCGCGATCAATGATCTCCTTTTTGTTGGAGAAAAGATTGGAGATGACGTCTAGCGCAAGCTTTTGGCTTTCCGATTCAAGTTCTTGATCTGATAGTCTGGGAGCCTTGATTGGAGCAGCTGGGACTTTTGCAGGTTGGGA  GGATAGTTTTAATTTTCTTGGATATTTTTAAATTCCG   GAGTACTGCCGCGCGATCGATCTTAGAAATTAGTTAAA  chrX    4775610 chrX    16142526
NS500418:110:H2VY7BGXX:4:13612:12507:3869   chrX    11052325    False   GGTCCAGCAAAACGCAGTAAAC  chrI    14497739    True    GTGGTGGAGGAGGAACGAATG   TACTTAACCTTTGCTCCGCGGCAAAACATGATCATTTGTTCAAATAGACAATTTCGTTTTTTCTTTGACGATCAGAGTCAATGAAGTTATCTAAGGCAATCACAAAACATTTTTGAAAAGCAGCAACAGGTCCAGCAAAACGCAGTAAAC  GTGGTGGAGGAGGAACGAATGGTTGTGGTCCGGCGAGTGGGGCCACTTGTGGCACAAAAGCTTGATGTCGGAGCAGATTTGGGGCGATCCCGTCTCGATGCTCGCCCACTCGGCAAAGGCGTTGATTCGGCTGGAACAACAAGCGTCTTC  GTTTACTGCGTTTTGCTGGACCTGTTGCAGCTTTTCAA  GTGGTGGAGGAGGAACGAATGGTTGTGGTCCGGCGAGT  chrX    11052290    chrI    14497765
NS500418:110:H2VY7BGXX:3:11604:7974:16095   chrX    7483102 False   CTAGTTCAATGAGGTATGTCAT  chrX    5875247 False   AAAAAACTGATGGTCTTATAT   CTTGGCTCAAATAAAACTGAAATCGAAAATAAAGTTTTGCATGTAAATACATTTTCAGAGTGCCTACGACTATTACCATCGAGATCGACGCGAATATAGTGTACCCTGCTTTCCTCGTTCTCGCCAACCTAGTTCAATGAGGTATGTCAT  TCACAGCCACCGGATATTCTGAGATGCTTCTTTTTTTGTTGTTGTCGTTAGATGTACAGTGCCATTCCGCATATCATTGATGTTAGGATCATCTAGCATCTACCAGAATTTTTCCTTTCTCTGAATTCTAAAAAACTGATGGTCTTATAT  ATGACATACCTCATTGAACTAGGTTGGCGAGAACGAGG  ATATAAGACCATCAGTTTTTTAGAATTCAGAGAAAGGA  chrX    7483067 chrX    5875222
NS500418:110:H2VY7BGXX:1:12207:12144:18475  chrI    11267978    True    TTTTTAGGCAGTATTCTGTGAA  chrI    7633132 True    GTTTTTAAGGTTTTCATCGAT   TTTTTAGGCAGTATTCTGTGAACTTTCCTGCATAGTTTCCACTATGATCACCATTTTTCTAGCTCTCCTGGTTCTCACTACAAGTCCTGGACAAGTCGAGGTAAGGCTGTTTAGCCTAACCGGCCCAATGGGCCCTGCTAGGCCTCACAG  GTTTTTAAGGTTTTCATCGATTTTAATTAAATTTTTATTCCAGGATGCACCAGGAAGTGAATTCAATATGCAACAGATGACATCAATGCACGACGATTCGACAACATTCACGAATCCAGTGTATGAATTAGAAGATGTTGATATGTCATC  TTTTTAGGCAGTATTCTGTGAACTTTCCTGCATAGTTT  GTTTTTAAGGTTTTCATCGATTTTAATTAAATTTTTAT  chrI    11268013    chrI    7633159
NS500418:152:H25C7AFXX:3:11408:4830:8603    chrIV   2481023 False   TGAATCATATCAGGGCAGCTG   chrIV   2542156 False   CGTTGCTTGCAGTGTTCCCTT   GAATTTAAATTTCCTAGTGAAAAATGACAAAAAATTATGTTTTTGTAAAAAATATCTCGAAAAAATGTTTTTTTTTTCTTTTTTTCACCTAAAATTTTTTTGTTTCAGAATTTTGTGGGTGTTGATCTATGAATCATATCAGGTCAGCTG  TGAAAAAAAAAATTTGCCAAAAAAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATAAAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT  CAGCTGCCCTGATATGATTCATAGAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATACAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT chrIV   2480995 chrIV   2542026

which I pass through:

gc GSM2041038_n2_adults_dpn.TSV |
    sls -Pattern '(chrIV.*chrIV.*chrIV.*chrIV)' |
    Export-Csv OnlyChrIV.tsv -Delimiter "`t"

And get (what I assume is) a tab delimited file with headers and these results:

#TYPE Selected.System.Management.Automation.PSCustomObject
"IgnoreCase"    "LineNumber"    "Line"  "Filename"  "Path"  "Pattern"   "Context"   "Matches"
"True"  "32"    "NS500418:152:H25C7AFXX:3:11408:4830:8603   chrIV   2481023  False  TGAATCATATCAGGGCAGCTG   chrIV   2542156 False   CGTTGCTTGCAGTGTTCCCTT   GAATTTAAATTTCCTAGTGAAAAATGACAAAAAATTATGTTTTTGTAAAAAATATCTCGAAAAAATGTTTTTTTTTTCTTTTTTTCACCTAAAATTTTTTTGTTTCAGAATTTTGTGGGTGTTGATCTATGAATCATATCAGGTCAGCTG  TGAAAAAAAAAATTTGCCAAAAAAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATAAAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT  CAGCTGCCCTGATATGATTCATAGAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATACAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT chrIV   2480995 chrIV   2542026"    "InputStream"   "InputStream"   "(chrIV.*chrIV.*chrIV.*chrIV)"  ""  "System.Text.RegularExpressions.Match[]"

The data which I want is in the "Line" column. So I then pass this file through this:

Import-Csv OnlyChrIV.tsv -Delimiter "`t" |
    select "line" |
    Export-Csv OnlyChrIV_OnlyLine.tsv -Delimiter "`t"

And I will get this:

#TYPE Selected.System.Management.Automation.PSCustomObject
"Line"
"NS500418:152:H25C7AFXX:3:11408:4830:8603   chrIV   2481023 False   TGAATCATATCAGGGCAGCTG   chrIV   2542156 False   CGTTGCTTGCAGTGTTCCCTT   GAATTTAAATTTCCTAGTGAAAAATGACAAAAAATTATGTTTTTGTAAAAAATATCTCGAAAAAATGTTTTTTTTTTCTTTTTTTCACCTAAAATTTTTTTGTTTCAGAATTTTGTGGGTGTTGATCTATGAATCATATCAGGTCAGCTG  TGAAAAAAAAAATTTGCCAAAAAAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATAAAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT  CAGCTGCCCTGATATGATTCATAGAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATACAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT chrIV   2480995 chrIV   2542026"

My issue is that I now can't break the string back into its original columns because I need to add headers and process the data further from there.

I want (which is how the data was originally formatted):

"NS500418:152:H25C7AFXX:3:11408:4830:8603" "chrIV" "2481023" "False"   "TGAATCATATCAGGGCAGCTG" "chrIV" "2542156"

Not:

"NS500418:152:H25C7AFXX:3:11408:4830:8603" 
"chrIV"
"2481023"
"False"
"TGAATCATATCAGGGCAGCTG"
"chrIV"
"2542156"

I've tried split but this outputs a new line for every tab like the above example. I also don't know if the input and/or output are the methods I should be using here.

This also needs to be done for a series of lines. I used only one line as an example here for clarity.

1 Answer 1

3

Don't use Select-String for filtering the data. Use Import-Csv to import the file. If your file doesn't have a header line you can specify your own headers via the -Header parameter:

$inFile  = 'GSM2041038_n2_adults_dpn.TSV'
$outFile = 'OnlyChrIV.tsv'

$headers = 'H1', 'H2', ...

Import-Csv $inFile -Delimiter "`t" -Header $headers | Where-Object {
    $_.H2 -eq 'chrIV' -and
    $_.H6 -eq 'chrIV' -and
    $_.H14 -eq 'chrIV' -and
    $_.H16 -eq 'chrIV'
} | Export-Csv $outFile -Delimiter "`t" -NoType
1
  • Ah yes, I knew there was a better way to do this. This works great. Thanks so much!
    – Steve
    Commented May 23, 2016 at 16:59

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.