2

I'm parsing a file using powershell 5.1 and I almost get my goal but I need help in this part. The input file is like this:

CODE=MMH4
description=beg  somedata - dduik
CODE=PPH2
description=beg Area1 - ABC=704&&DEF=03||ABC=706&&DEF=04
END
CODE=LGT6
description=beg  somedata - yyu
END
CODE=KK7
description=beg Area4 - ABC=334&&DEF=030
END

This is my current script

(Get-Content file.txt) `
      -replace '\|\|', "`r`n" `
      -replace '.*description=.*- ', "" `
      -replace '\&\&', "`t"

And I'm getting this output

CODE=MMH4
dduik
CODE=PPH2
ABC=704 DEF=03
ABC=706 DEF=04
END
CODE=LGT6
yyu
END
CODE=KK7
ABC=334 DEF=030
END

I'm new to powershell and I'd like to get the CODE value for each block and put that code next to each internal line of the block like this:

CODE=MMH4
dduik  MMH4
CODE=PPH2
ABC=704 DEF=03  PPH2
ABC=706 DEF=04  PPH2
END
CODE=LGT6
yyu  LGT6
END
CODE=KK7
ABC=334 DEF=030  KK7
END

To filter the lines that contains "ABC" like this

ABC=704 DEF=03  PPH2
ABC=706 DEF=04  PPH2
ABC=334 DEF=030  KK7

and finally get the output with ABC= and DEF= removed to have only the numbers, like this

ABC DEF CODE
704 03  PPH2
706 04  PPH2
334 030  KK7

I know that to filter desired lines I need to pipe | Select-String -Pattern "ABC" but I don't know how to get previous step that is the code for internal lines of each block. I hope make sense.

Thanks in advance

5
  • Are you interested in the intermediate output or just the final? It might be easier to approach this differently if you only care about that final output Commented yesterday
  • Hi, I'm interested in final output, if possible doing a pipe after the replace commands, since this is a small example and in my actual file I have more replace commands to transform the file.to look like in my current output. Thanks for any help Commented yesterday
  • Another question, appearances of ABC=... always come with && then DEF=... ? Commented yesterday
  • And, are always 2 groups of ABC=...&&DEF=...||ABC=...&&DEF=... or could there be more "OR" conditions? Commented yesterday
  • There could be many OR conditions concatenated in a single line. I replace with newline and converts in many lines inside each group. Each line with ABC always follows with DEF. I replace the && by a tab since my final goal is to look content in Ms excel in different columns..thanks Commented yesterday

1 Answer 1

1

Perhaps the following can help you get your desired final output, it is a more manual and classic approach, it doesn't intent to solve everything with regex but it does help with the logical conditions:

Get-Content file.txt | ForEach-Object {
    # if the line starts with CODE= followed by a code
    if ($_ -match '^CODE=(.+)') {
        # capture the code (e.g. MMH4, PPH2, etc)
        $code = $Matches[1]
        # and go to next line
        return
    }

    # if we have captured a code in previous lines
    if ($code) {
        # loop thru each match of a line containing a combination of `ABC=...&&DEF=...`
        # we use [Regex]::Matches(..) here as there could be multiple `||` conditions
        foreach ($match in [regex]::Matches($_, '(?:ABC=\d+&&DEF=\d+)+')) {
            # replace `&&` with a TAB concatenated with TAB and the captured code
            "$($match.Value.Replace('&&', "`t"))`t$code"
        }

        # then we clear this variable to look for a new line starting with CODE=
        # NOTE: See the end of the answer to understand the usage of this
        $code = $null
    }
}

The above, using your sample text would produce the following output:

ABC=704 DEF=03  PPH2
ABC=706 DEF=04  PPH2
ABC=334 DEF=030 KK7

You might notice the use of $code = $null, this is a tiny optimization to not enter the if ($code) condition more than once for each appearance of a line starting with CODE=, it could be removed and removing it could have a different effect depending on the input, taking this as a small sample:

CODE=KK7
description=beg Area4 - ABC=334&&DEF=030
description=beg Area4 - ABC=335&&DEF=031
END

If left as-is, the result for the sample would be:

ABC=334 DEF=030 KK7

Whereas if that line of code is removed, the logic would match both inner lines:

ABC=334 DEF=030 KK7
ABC=335 DEF=031 KK7

EDIT

Looking at the latest edit on your question, you're seemingly wanting to convert your text into a TSV, for this I'd recommend to use a function where you can output the headers in its begin block; you could technically use ForEach-Object -Begin {...} too if you like.

So the code doesn't change that much from what was provided above, the regex pattern will use 2 capturing groups to get the value of ABC= and DEF=... see here for details: https://regex101.com/r/qvGBZy/1.

function ConvertTo-Tsv {
    param(
        [Parameter(ValueFromPipeline, Mandatory)]
        [string] $Line,

        # using a parameter instead of hardcoding them in `begin {..}`
        # gives more flexibility, you can provide new headers when needed
        [Parameter(Position = 0)]
        [string[]] $Header = ('ABC', 'DEF', 'CODE')
    )

    begin {
        # output the header joined by TAB
        $Header -join "`t"
        $re = [regex]::new('(?:ABC=(?<abc>\d+)&&DEF=(?<def>\d+))+')
    }

    process {
        if ($Line -match '^CODE=(.+)') {
            $code = $Matches[1]
            return
        }

        if ($code) {
            foreach ($match in $re.Matches($Line)) {
                # output values of ABC= and DEF= with the value of CODE=
                "$($match.Groups['abc'])`t$($match.Groups['def'])`t$code"
            }

            $code = $null
        }
    }
}

Then the usage is pretty straight forward, read the file content, pipe to our function and lastly pipe to file storage:

Get-Content myFile.txt | ConvertTo-Tsv | Set-Content myOutput.tsv

The expected output using the function given the sample input text would be:

ABC     DEF     CODE
704     03      PPH2
706     04      PPH2
334     030     KK7

And we can tell this is a valid TSV given that a conversion from it produces the following objects:

PS> Get-Content sample.txt | ConvertTo-Tsv | ConvertFrom-Csv -Delimiter "`t"

ABC DEF CODE
--- --- ----
704 03  PPH2
706 04  PPH2
334 030 KK7
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks so much @santiago-squarzon I was able to concatenate my previous script of several -replace to clean the file and then apply your script. I only want to know if it is possible to concatenate 2 more replace commands after your script to remove ABC= and DEF= after all have been processed to get only the numbers? something like this (Get-Content input.txt) -replace ... | ForEach-Object {...} | -replace 'ABC=', "" -replace 'DEF=', "" | Out-File Output.txt where your script is the part ForEach-Object {...} and the part I want to add is | -replace 'ABC=', "" -replace 'DEF=', ""
Looks like it should be possible, it's probably better to edit my current code to just remove ABC= and DEF=. Can you edit your question showing the final expected output and I'll edit my answer after when I'm near my pc
I've edited the desired output for you to get an idea. Thanks so much for your help. If add the header is too complicated, only the numbers and codes would be fine.
have a look at the latest edit on my answer, let me know if any doubts / issues ;)
It works perfectly. I am extremely grateful for your help and time. The only thing is that takes 21 seconds to complete, maybe because the file is larger. Doesn't matter, it makes the job,
Regarding performance, this is a very easy fix, just change Get-Content to [System.IO.File]::ReadLines("C:\path\to\theFile.txt") (make sure you use an absolute path with it
I've tried using [System.IO.File]::ReadLines("C:\path\to\theFile.txt") but takes the same time or 1 second more to complete. And trying I found the 20 seconds are consumed in the first part of script where I use 5 replace commands and regexes to clean a very ugly file format, before to use your function. Thank you again Santiago.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.