5

My input file data:

"Per Sara Porras.|, LLC"|column2_data|column3_data
column1_data|"column2|data"|"column3|data"

Required output:

"Per Sara Porras.@@@, LLC"|column2_data|column3_data
column1_data|"column2@@@data"|"column3@@@data"

I need this, so I can do required data sorting using IFS='|', currently data is getting splitted incorrectly because of | being present inside double quotes. Once my data sorting is done, I would need the @@@ present inside double quotes to be replaced by | again.

After sorting the data alphabetically, my file should look like this-

column2_data|column3_data|"Per Sara Porras.|, LLC"
column1_data|"column2|data"|"column3|data"

tried awk, sed multiple commands, didn't work out :(.

5
  • 1
    Do you have (or can you install) Miller ? Commented Feb 19 at 21:28
  • 2
    this sounds a lot like you should simply not do the sorting using your shell , but rather a tool that can actually deal with tabular data. Miller, as mentioned by steeldriver, is a good choice. But usually, people having to deal with CSV data then later on process it in some other programming language: it's probably wisest to stick to that language (unless it's just shell programming, which is really not well-equipped for this kind of task) Commented Feb 19 at 21:40
  • 1
    You might want to read Why is using a shell loop to process text considered bad practice?. Please don't try to do this using IFS in the shell, that is just a very bad tool for this kind of thing.
    – terdon
    Commented Feb 19 at 21:51
  • 1
    How performant does this need to be? Theoretically, you could read the file a character at a time, counting the number of " marks seen so far on the line. An odd number means you're inside a quoted field; an even-number (or zero) means you're not.
    – spuck
    Commented Feb 20 at 23:13
  • Greetings, based on the last code block, do you really want column2_data to end up in the first column? Perhaps you're trying to sort rows by the first column instead? Thx. Commented Feb 22 at 21:50

5 Answers 5

8

Miller (mlr) is a tool for working with structured data such as CSV (and other formats). It understands the quoting rules of CSV and has no issues working with the data presented in the question, which happens to be properly formatted CSV (albeit without field headers).

Using Miller to iterate over each pipe-delimited field of the headerless CSV data, replacing each pipe with three at-characters:

$ mlr --csv -N --fs pipe put 'for (k,v in $*) { $[k] = gssub(v,"|","@@@") }' file
Per Sara Porras.@@@, LLC|column2_data|column3_data
column1_data|column2@@@data|column3@@@data

I opted for using gssub() to do the substitution. This works like gsub() but does not interpret its 2nd argument (the pattern) as a regular expression, but as a text string.

Note that Miller will not quote fields in the output that do not need quoting.

You could do it in a three-stage pipeline too, by replacing the pipe delimiters with the more commonly used comma, use sed to replace every remaining pipe with @@@ and then converting back to the original pipe delimiters:

$ mlr --csv -N --ifs pipe cat file | sed 's/|/@@@/g' | mlr --csv -N --ofs pipe cat
Per Sara Porras.@@@, LLC|column2_data|column3_data
column1_data|column2@@@data|column3@@@data

... but if all you want to do is to sort the fields within each record, I would stay with a single Miller invocation:

$ mlr --csv -N --fs pipe put 'sorted = sort(get_values($*), "c"); for (k,v in sorted) { $[k] = v }' file
column2_data|column3_data|"Per Sara Porras.|, LLC"
column1_data|"column2|data"|"column3|data"

This sorts the values of each record in turn (case-insensitively; see the sort() function docs), reassigning the sorted values to the fields of the record before continuing with the next record.

A tiny bit more compact:

$ mlr --csv -N --fs pipe put 'for (k,v in sort(get_values($*), "c")) { $[k] = v }' file
column2_data|column3_data|"Per Sara Porras.|, LLC"
column1_data|"column2|data"|"column3|data"
0
4

I’m taking this question to mean:

  1. How can I replace all occurrences of a delimiter like | when it appears inside quotes...
  2. with a placeholder that is arbitrary, temporary, and harmless (such as @@@)...
  3. so that I can then process the table and not worry about quoted delimiters...
  4. and then change the placeholder back to the delimiter?

As long as you don’t actually need @@@ as a placeholder, this is a job for csvquote:

csvquote -d'|' input.txt | (your code here) | csvquote -d'|' -u
  • The first invocation of csvquote encodes the input with the non-printing unit separator (US) character as a placeholder.
  • The second invocation (with the -u flag) puts the | characters back where they were.
  • The -d'|' argument sets the delimiter to | instead of ,.
Encoding with a placeholder

Replace quoted instances of | with the US character and save the result in input_encoded.txt:

csvquote -d'|' input.txt > input_encoded.txt

The US character is 0x1F in hex, or 037 in octal. We can see what csvquote is doing using od -c:

$ od -c input.txt     # original
0000000   "   P   e   r       S   a   r   a       P   o   r   r   a   s
0000020   .   |   ,       L   L   C   "   |   c   o   l   u   m   n   2
0000040   _   d   a   t   a   |   c   o   l   u   m   n   3   _   d   a
0000060   t   a  \n   c   o   l   u   m   n   1   _   d   a   t   a   |
0000100   "   c   o   l   u   m   n   2   |   d   a   t   a   "   |   "
0000120   c   o   l   u   m   n   3   |   d   a   t   a   "  \n
0000136

$ csvquote -d'|' input.txt | od -c  # encoded
0000000   "   P   e   r       S   a   r   a       P   o   r   r   a   s
0000020   . 037   ,       L   L   C   "   |   c   o   l   u   m   n   2
0000040   _   d   a   t   a   |   c   o   l   u   m   n   3   _   d   a
0000060   t   a  \n   c   o   l   u   m   n   1   _   d   a   t   a   |
0000100   "   c   o   l   u   m   n   2 037   d   a   t   a   "   |   "
0000120   c   o   l   u   m   n   3 037   d   a   t   a   "  \n
0000136

Incidentally, quoted record separators (which are newlines, by default) get replaced with the non-printing Record Separator character, which is 0x1E in hex and 036 in octal.

Decoding

Change the quoted US characters back to |:

csvquote -d'|' -u input_encoded.txt

As we can see, the file is restored:

$ csvquote -d'|' -u input_encoded.txt | od -c
0000000   "   P   e   r       S   a   r   a       P   o   r   r   a   s
0000020   .   |   ,       L   L   C   "   |   c   o   l   u   m   n   2
0000040   _   d   a   t   a   |   c   o   l   u   m   n   3   _   d   a
0000060   t   a  \n   c   o   l   u   m   n   1   _   d   a   t   a   |
0000100   "   c   o   l   u   m   n   2   |   d   a   t   a   "   |   "
0000120   c   o   l   u   m   n   3   |   d   a   t   a   "  \n
0000136
Configuring the placeholders

These placeholders (US and RS) are hardcoded, but if you really need a different placeholder, you can edit csvquote.c in the source code, change these lines, and compile it yourself:

#define NON_PRINTING_FIELD_SEPARATOR 0x1F
#define NON_PRINTING_RECORD_SEPARATOR 0x1E
0
3

Using any awk to replace a character or string within quoted fields:

$ awk 'BEGIN{FS=OFS="\""} {for (i=2; i<=NF; i+=2) gsub(/\|/,"@@@",$i)} 1' file
"Per Sara Porras.@@@, LLC"|column2_data|column3_data
column1_data|"column2@@@data"|"column3@@@data"

Here's how to do the whole sorting part too using the Decorate-Sort-Undecorate idiom with any POSIX-compliant version of awk and sort:

$ cat tst.sh
#!/usr/bin/env bash

sep='|'
repl='@@@'

awk -v sep="$sep" -v repl="$repl" '
    BEGIN { FS=OFS="\"" }
    {
        for ( i=2; i<=NF; i+=2 ) {
            gsub("["sep"]", repl, $i)
        }
        n = split($0, f, sep)
        for ( i=1; i<=n; i++ ) {
            gsub(/^"|"$/, "", f[i])
            print NR sep i sep f[i]
        }
    }
' "${@:--}" |
sort -t"$sep" -k1,1n -k3,3f -k2,2n |
awk -v sep="$sep" -v repl="$repl" '
    BEGIN { FS=sep }
    $1 != prev {
        if ( NR > 1 ) { print rec }
        rec = ""
        prev = $1
    }
    { sub("([^"sep"]["sep"]){2}", "") }
    gsub(repl, sep) { $0 = "\"" $0 "\"" }
    { rec = (rec == "" ? "" : rec sep) $0 }
    END { print rec }
'

$ ./tst.sh file
column2_data|column3_data|"Per Sara Porras.|, LLC"
column1_data|"column2|data"|"column3|data"

The above assumes your field separator (|) is a single character that can be enclosed in a bracket expression ([|]) to be literal, that your replacement string (@@@) doesn't contain regexp metacharacters, and that neither of them contain the backreference metacharacter &.

If your input might contain @@@ then you could use GNU tools to allow \0 record terminators for awk and sort so you can have multi-line records and replace each | with \n instead of @@@, or just do it all in GNU awk since it has its own CSV-handling and sorting functionality, e.g. with GNU awk:

$ cat tst.sh
#!/usr/bin/env bash

awk -v FPAT='[^|]*|("([^"]|"")*")' -v OFS='|' '
    BEGIN {
        IGNORECASE = 1
        PROCINFO["sorted_in"] = "@val_str_asc"
    }
    {
        delete f
        for ( i=1; i<=NF; i++ ) {
            f[i] = gensub(/^"|"$/, "", "g", $i)
        }
        rec = ""
        for ( i in f ) {
            q = ( f[i] ~ /\|/ ? "\"" : "" )
            rec = (rec == "" ? "" : rec OFS) q f[i] q
        }
        print rec
    }
' "${@:--}"

$ ./tst.sh file
column2_data|column3_data|"Per Sara Porras.|, LLC"
column1_data|"column2|data"|"column3|data"

For more information on operating on CSV files with awk, see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk.

0

Using Raku (formerly known as Perl_6)

~$ raku -MText::CSV -e '
         my @a = csv(in => $*IN, sep => "|", escape_char => "", allow_loose_quotes => 1);   
         my $index = @a>>.[0].pairs.sort(*.value).map: *.key;  
         @a = @a.[$index.cache]; csv(in => @a, out => $*OUT, sep => "|");'  < file

Here maybe the best approach is to use an authentic CSV-parser. Raku is a programming language in the Perl-family. Raku's Text::CSV module has been developed by the same author/maintainer of Perl(5)'s Text::CSV_XS module (H. Merijn Brand).

The file is read-in using the module's high-level csv() subroutine. Data is taken as $*IN STDIN. The separator is set appropriate to the file. The two other input parameters, escape_char => "", allow_loose_quotes => 1 accept quoting as-is from the input file (this is a trick to force double-quoting of any unescaped double-quotes, see first link at bottom). Data is stored in@a, an array.

Once you have data stored in an array, you don't need to substitute @@@ in place of |. The code above gives the following sorted output (sorting here is assumed to be row-wise sorting based on Column 1. Note for row-wise sorting you create an $index based on @a>>.[0] values in the 1st column, then @a.[$index.cache] apply the index row-wise.

Sample Input:

Z|"Per Sara Porras.|, LLC"|column2_data|column3_data
A|column1_data|"column2|data"|"column3|data"

Sample Output 1 (from code at top):

A|column1_data|"column2|data"|"column3|data"
Z|"Per Sara Porras.|, LLC"|column2_data|column3_data

Maybe you want row-wise sorting, substituting @@@ in place of |. However this has the effect of removing quotes from the output (any fields without whitespace), as follows:

Sample Output 2:

~$ raku -MText::CSV -e '
         my @a = csv(in => $*IN, sep => "|", escape_char => "", allow_loose_quotes => 1);
         @a = @a>>.map(*.subst(:global, / \| /,  "@@@"));
         my $index = @a>>.[0].pairs.sort(*.value).map: *.key;
         @a = @a.[$index.cache]; csv(in => @a, out => $*OUT, sep => "|");'  <  file
A|column1_data|column2@@@data|column3@@@data
Z|"Per Sara Porras.@@@, LLC"|column2_data|column3_data

You could get around this by always-quote-ing the output, i.e. add always-quote => True to the end of the final statement. Let's combine this approach with a method for taking the separator from the environment--just look it up in Raku's %*ENV hash:

Sample Output 3:

~$ env ifs="|" raku -MText::CSV -e '
               my @a = csv(in => $*IN, sep => %*ENV<ifs>, escape_char => "", allow_loose_quotes => 1); 
               @a = @a>>.map(*.subst(:global, / \| /,  "@@@")); 
               my $index = @a>>.[0].pairs.sort(*.value).map: *.key; 
               @a = @a.[$index.cache]; csv(in => @a, out => $*OUT, sep => "|", always-quote => True);'  < file
"A"|"column1_data"|"column2@@@data"|"column3@@@data"
"Z"|"Per Sara Porras.@@@, LLC"|"column2_data"|"column3_data"

Note: You can get back the quoting you started with by replacing embedded | bar character with a character-sequence containing whitespace, such as @@ @@ (instead of @@@). The CSV-parser will only quote whitespace-containing columns by default:

Sample Output 4:

~$ env ifs="|" raku -MText::CSV -e '
               my @a = csv(in => $*IN, sep => %*ENV<ifs>, escape_char => "", allow_loose_quotes => 1);
               @a = @a>>.map(*.subst(:global, / \| /,  "@@ @@"));
               my $index = @a>>.[0].pairs.sort(*.value).map: *.key;
               @a = @a.[$index.cache]; csv(in => @a, out => $*OUT, sep => "|");' < file
A|column1_data|"column2@@ @@data"|"column3@@ @@data"
Z|"Per Sara Porras.@@ @@, LLC"|column2_data|column3_data

Finally, if you really really want column-wise sorting, the only way that makes sense is to set an index row such as a header row. For ideas on column-wise sorting, see the second link below.

https://unix.stackexchange.com/a/775855/227738
https://unix.stackexchange.com/a/746864/227738
https://raku.org

-1

You can do this using awk, but it isn't very pretty*. You could set a flag when you encounter a " and then unset it with the closing " and replace | only if the flag is set. Of course, this also assumes you can never have " within ", so nothing like "some \"field\" like this". That said, this does what you have asked for on your example data:

{
  line="";
  c=""
  for(i=1; i<=length($0); i++){
    c=substr($0,i,1);
    if(c == "\""){
      a = a ? 0 : 1
    }
    else if(c == "|" && a == 1){
      c="@@@"
    }
    line=line ? line""c : c
  }
  print line
}  file

Save that as foo.awk and run it like this:

awk -f foo.awk < input.file

Or, here's the same thing as a one liner:

awk '{
  line="";
  c=""
  for(i=1; i<=length($0); i++){
    c=substr($0,i,1);
    if(c == "\""){
      a = a ? 0 : 1
    }
    else if(c == "|" && a == 1){
      c="@@@"
    }
    line=line ? line""c : c
  }
  print line
}'  file

The output when running on your example is:

$ awk '{ line=""; c=""; for(i=1; i<=length($0); i++){ c=substr($0,i,1); if(c == "\""){ a = a ? 0 : 1 } else if(c == "|" && a == 1){ c="@@@" } line=line ? line""c : c } print line}'  file
"Per Sara Porras.@@@, LLC"|column2_data|column3_data
column1_data|"column2@@@data"|"column3@@@data"

But really, this is a very bad idea. Just use a tool that understands csv format instead.


* At any rate, the approach I came up with isn't very pretty, but while I suspect this is pretty much as good as it gets, bar minor stylistic improvements, maybe someone else can come up with something better.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.