3

I'm looking for a way to strip all comments from a file. There are various ways to do comments, but I'm only interested in the simple # form comments. Reason is that I only use <# #> for in-function .SYNOPSIS which is functional code as opposed to just a comment, so I want to keep those.

EDIT: I have updated this question using the helpful answers below.

So there are only a couple of scenarios that I need:

a) whole line comments with # at start of line (or possibly with white-space before. i.e. regex of ^\s*# seems to work.

b) with some code at the start of a line, then a comment at the end of the line. I want to avoid stripping lines that have e.g. Write-Host "#####" but I think this is covered in the code that I have.

I was able to remove end-of-line comments with a split as I couldn't work out how to do it with regex, does anyone know a way to achieve that with regex?

The split was not ideal as a <# on a line would be removed by the -split but I've fixed that by splitting on " #". This is not perfect but might be good enough - maybe a more reliable way with regex might exist?

When I do the below against my 7,000 line long script, it works(!) and strips a huge amount of comments, BUT, the output file is almost doubled in size(!?) from 400kb to about 700kb. Does anyone understand why that happens and how to prevent that (is it something to do with BOM's or Unicode or things like that? Out-File seems to really balloon the file-size!)

$x = Get-Content ".\myscript.ps1"   # $x is an array, not a string
$out = ".\myscript.ps1"
$x = $x -split "[\r\n]+"               # Remove all consecutive line-breaks, in any format '-split "\r?\n|\r"' would just do line by line
$x = $x | ? { $_ -notmatch "^\s*$" }   # Remove empty lines
$x = $x | ? { $_ -notmatch "^\s*#" }   # Remove all lines starting with # including with whitespace before
$x = $x | % { ($_ -split " #")[0] }    # Remove end of line comments
$x = ($x -replace $regex).Trim()       # Remove whitespace only at start and end of line
$x | Out-File $out
# $x | more
3
  • 2
    In Windows PowerShell, I believe out-file defaults to encoding UTF16-LE, probably with BOM. You could try Set-Content instead, which defaults to ANSI encoding. For either command, you can use -Encoding parameter. The UTF8 encoding will have BOM with those commands. Commented Apr 3, 2020 at 18:13
  • 1
    This is great to know thanks, as soon as I used Set-Content the file size went down to 220k instead of 700k. I've never gotten my head around the use of these various encodings and why some are so bloated ... Thanks. Commented Apr 3, 2020 at 20:31
  • Where's your $regex defined? Commented Dec 8, 2022 at 22:54

6 Answers 6

7

Honestly, the best approach to identify and process all comments is to use PowerShell's language parser or one of the Ast classes. I apologize that I don't know which Ast contains comments; so this is an uglier way that will filter out block and line comments.

$code = Get-Content file.txt -Raw
$comments = [System.Management.Automation.PSParser]::Tokenize($code,[ref]$null) |
    Where Type -eq 'Comment' | Select -Expand Content
$regex = ( $comments |% { [regex]::Escape($_) } ) -join '|'

# Output to remove all empty lines
$code -replace $regex -split '\r?\n' -notmatch '^\s*$'

# Output that Removes only Beginning and Ending Blank Lines
($code -replace $regex).Trim()
Sign up to request clarification or add additional context in comments.

1 Comment

This is amazing. As often with your answers, it's a level way above things that I knew existed. Ast classes are some kind of 'dark art magic' to me. I see that your method here would definitely be the very best approach for a comprehensive solution. I think my need currently is a bit simpler though, so I've updated my main question (and used your removal of empty lines and line trim beginning/end, that is very useful thanks), if you have some solutions on the revised points in there that would be appreciated.
1

Do the inverse of your example: Only emit lines that do NOT match:

## Output to console
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' }

## Output to file
Get-Content .\file.ps1 | Where-Object { $_ -notmatch '#' } | Out-file .\newfile.ps1 -Append

1 Comment

Ideal, I've updated my question using your answer here as a basis, it's led me to a few other refinements that maybe you have an idea on how to solve?
1

Based on @AdminOfThings helpful answer using the Abstract Syntax Tree (AST) Class parser approach but avoiding any regular expressions:

$Code = $Code.ToString() # Prepare any ScriptBlock for the substring method
$Tokens = [System.Management.Automation.PSParser]::Tokenize($Code, [ref]$null)
-Join $Tokens.Where{ $_.Type -ne 'Comment' }.ForEach{ $Code.Substring($_.Start, $_.Length) }

1 Comment

It would be great if you could also provide an example + output using your code snippet. Especially since using this AST magic is much harder to read/interpret than a typical RE. (E.g. What is a type Comment and what's included in that?)
1

As for the incidental problem of the size of the output file being roughly double that of the input file:

  • As AdminOfThings points out, Out-File in Windows PowerShell defaults to UTF-16LE ("Unicode") encoding, where characters are represented by (at least) two bytes, whereas ANSI encoding, as used by Set-Content in Windows PowerShell by default, encodes all (supported) characters in a single byte. Similarly, UTF-8-encoded files use only one byte for characters in the ASCII range (note that PowerShell (Core) 7+ now consistently defaults to (BOM-less) UTF-8). Use the -Encoding parameter as needed.

A regex-based solution to your problem is never fully robust, even if you try to limit the comment removal to single-line comments.

For full robustness, you must indeed use PowerShell's language parser, as noted in the other answers.

However, care must be taken when reconstructing the original source code with the comments removed:

  • AdminOfThings's answer risks removing too much, given the subsequent global regex-based processing with -replace: while the scenario may be unlikely, if a comment is repeated inside a string, it would mistakenly be removed from there too.

  • iRon's answer risks syntax errors by joining the tokens without spaces, so that . .\foo.ps1 would turn into ..\foo.ps1, for instance. Blindly putting a space between tokens is not an option, because the property-access syntax would break (e.g. $host.Name would turn into $host . Name, but whitespace between a value and the . operator isn't allowed)

The following solution avoids these problems, while trying to preserve the formatting of the original code as much as possible, but this has limitations, because intra-line whitespace isn't reported by the parser:

  • This means that you can't tell whether whitespace between tokens on a given line is made up of tabs, spaces, or a mix of both. The solution below replaces any tab characters with 2 spaces before processing; adjust as needed.

  • To somewhat compensate for the removal of comments occupying their own line(s), more than 2 consecutive blank or empty lines are folded into a single empty one. It is possible to remove blank/empty lines altogether, but that could hurt readability.

# Tokenize the file content.
# Note that tabs, if any, are replaced by 2 spaces first; adjust as needed.
$tokens = $null
$null = [System.Management.Automation.Language.Parser]::ParseInput(    
  ((Get-Content -Raw .\myscript.ps1) -replace '\t', '  '), 
  [ref] $tokens,
  [ref] $null
)  

# Loop over all tokens while omitting comments, and rebuild the source code 
# without them, trying to preserve the original formatting as much as possible.
$sb = [System.Text.StringBuilder]::new() 
$prevExtent = $null; $numConsecNewlines = 0
$tokens.
  Where({ $_.Kind -ne 'Comment' }).
  ForEach({ 
    $startColumn = if ($_.Extent.StartLineNumber -eq $prevExtent.StartLineNumber) { $prevExtent.EndColumnNumber }
                   else { 1 }
    if ($_.Kind -eq 'NewLine') {
      # Fold multiple blank or empty lines into a single empty one.
      if (++$numConsecNewlines -ge 3) { return }
    } else {
      $numConsecNewlines = 0
      $null = $sb.Append(' ' * ($_.Extent.StartColumnNumber - $startColumn))
    }
    $null = $sb.Append($_.Text)
    $prevExtent = $_.Extent
  })

# Output the result.
# Pipe to Set-Content as needed.
$sb.ToString()

2 Comments

This code is crazy to read/understand. I don't see how this could be better than using a RegEx, especially, when dealing with simple EOL comments like blah bla code # my comment. (People still keep on saying that RE is not robust enough, when the thing that is not robust, are peoples lack of using it properly and taking into account it's greediness.) Would have been great to see some more detailed comments for that code. And where are those type keywords, such as Comment defined and listed?
@not2qubit, the reason this is better than a regex is stated in the answer. Yes, if you're only dealing with simple scenarios, the solution in this answer may not be worth the effort. No, it isn't just improper use of regexes; they're not sophisticated enough to parse languages such as PowerShell. If you follow the language-parser link in the answer, you should be able to find your way toward the TokenKind enumeration.
1

My take on the problem, minimizing use of regex (on comment matching/replacement part at least):

# $code = ...get string contents of file or script block...
$commentTokens = [System.Management.Automation.PSParser]::Tokenize($code, [ref]$null) |
    Where Type -eq 'Comment'

$newCode = $code

# any unique token that we will replace at the end (could generate guid if we want to)
$newComment = "<#~xRMx~#>"
# $newComment = "<#{0}#>" -f [Guid]::NewGuid()

$overlapSize = 0

# Normalize all comments to a known comment value `$newComment`
$commentTokens | foreach {
    # adjust starting position based on previous replacement overlap sizes
    $start = $_.Start + $overlapSize 

    $newCode = $newCode.Remove($start, $_.Length).Insert($start, $newComment)

    $overlapSize += ($newComment.Length - $_.Length) # calculate overlap sizes
}

$newCode = $newCode -replace $newComment, "" `
    # -split '\r?\n' -notmatch '^\s*$' ` # uncomment to remove blank lines
    -join "`n"

$newCode

From my testing handles various comment configurations (single-line, multi-line, partial-line). Check out this replit project. You can fork it and play around with it.

Example input & output:

$ScriptBlock = {
    ## 1234
    <# idk #># hello
    # test
    $test = 1234; # testing abc
    <# line1
    line2
    line3
    #>
    Write-Host "Hello World"
}

...

# Output:
<#
    
    
    
    $test = 1234; 
    
    Write-Host "Hello World"

#>

# Output (w/ removing blank lines):
<#
    $test = 1234; 
    Write-Host "Hello World"
#>

Comments

0

I tried tweaking AdminOfThings' answer with regex anchors,

$regExp1 = '(?m)^(\s*)(' + ($comments.ForEach({ [regex]::Escape($_) }) -join '|') + ')\s*?'
$regExp2 = '(?m)\s*(' + ($comments.ForEach({ [regex]::Escape($_) }) -join '|') + ')$'

# Delete comments
$code = $code -creplace $regExp1, '$1'
$code = $code -creplace $regExp2

but inline block comments proved to be problematic.

This solution (inspired by mklement0) makes a copy of the code, replacing the comments with a distinguishable sequence that can then be removed:

$outStr = ''
$tokens = [System.Management.Automation.PSParser]::Tokenize($code, [ref]$null)
$tokens.
    ForEach({
        # Restore whitespace discarded by parser
        if ($outStr.Length -lt $_.Start) {
            $outStr += $code.Substring($outStr.Length, $_.Start - $outStr.Length)
        }
        # Replace comments with distinguishable sequence
        if ($_.Type -eq 'Comment') {
            $outStr += ([char]0x00).ToString() * $_.Length
        } else {
            $outStr += $code.Substring($_.Start, $_.Length)
        }
    })

# Delete replacement sequence
$outStr = $outStr -replace ' ?\x00+ ?', ' '

# Delete trailing whitespace and blank lines
$outStr = ($outStr -split '\r?\n' -notmatch '^\s*$').TrimEnd() -join "`r`n"

Example input:

$code = @'
<#
    .DESCRIPTION
    Demonstrates PowerShell's different comment styles.
#>
param (
    [string] $Param1, # End-of-line comment
    <# Inline block comment #> $Param2
)

$var = 1, <# Inline block comment #> 2, 2

# Single-line comment.
# Another single-line comment.
$var.Where(
    <# Arg1 note #> { $_ -eq 2 },
    <# Arg2 note #> 'First',
    <# Arg3 note #> 1
)

. .\foo.ps1
$host.Name
$str1 = "<# Inline block comment #>"
$str2 = '# Single-line comment.'
'@

Output:

param (
    [string] $Param1,
    $Param2
)
$var = 1, 2, 2
$var.Where(
    { $_ -eq 2 },
    'First',
    1
)
. .\foo.ps1
$host.Name
$str1 = "<# Inline block comment #>"
$str2 = '# Single-line comment.'

This slightly faster alternative uses a StringBuilder object instead:

$sb = [System.Text.StringBuilder]::new()
$tokens = [System.Management.Automation.PSParser]::Tokenize($code, [ref]$null)
$tokens.
    ForEach({
        # Restore whitespace discarded by parser
        if ($sb.Length -lt $_.Start) {
            $null = $sb.Append($code.Substring($sb.Length, $_.Start - $sb.Length))
        }
        # Replace comments with distinguishable sequence
        if ($_.Type -eq 'Comment') {
            $null = $sb.Append(([char]0x00).ToString() * $_.Length)
        } else {
            $null = $sb.Append($code.Substring($_.Start, $_.Length))
        }
    })
$outStr = $sb.ToString()

# Delete replacement sequence
$outStr = $outStr -replace ' ?\x00+ ?', ' '

# Delete trailing whitespace and blank lines
$outStr = ($outStr -split '\r?\n' -notmatch '^\s*$').TrimEnd() -join "`r`n"

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.