2

I have a text log file

$ cat aaa
673                  20160405 root "/path_to/gis/20160401/20160301_placement_map_org.dbf" ""
673                  20160405 root "/path_to/gis/20160401/20160310_20160401ent_map_org.dbf" ""
790890               20170201 jle  "/path_to/gis/20160401/Pina (Asc) 20160401 Rapid Report.kmz" ""
5883710              20160406 dho  "/path_to/gis/20160401/20160401_Pina_Asc_Rapid_Report_Minesouth.pdf" ""
673                  20160405 dho  "/path_to/gis/20160401/20160310_20160401 placement map org.dbf" ""

Now I have this script output just the full path of the files:

#!/bin/bash

function nodatechk() {
    arr=("$@")
    for ((i=3;i<${#arr[@]};i+=5));
    do
      echo "${i}" "${arr[i]}"
    done
}

r=( $(grep gis aaa) ) 

nodatechk "${r[@]}"

The output is break because of the 3rd line (and 5th line) have a space in the element, although it has double quote.

How can I fix this? (BTW I know I can use awk or cut to print out columns, but in this case I just want use grep.) Thanks.

7
  • Look at the order of expansions in man bash. $(grep gis aaa) undergoes word splitting, but literal double quotes aren't there, they're a result of the command substitution, so they don't influence the word splitting.
    – choroba
    Commented Dec 4, 2018 at 23:34
  • Note that awk '/gis/ && n++ % 5 == 3 {print n-1, $0}' < aaa is simpler and would be several orders of magnitude faster for large inputs. You generally don't want to use shell loops to process text. Commented Dec 5, 2018 at 8:34
  • 1
    You don't seem to be using grep to output any columns. It is unclear what you want to be doing with grep to output the column that you are interested in. awk would be a better choice for that. It's also unclear whether the columns are tab or space separated.
    – Kusalananda
    Commented Dec 6, 2018 at 17:13
  • I guess you want to print the fourth word (column), i.e., the pathname, from every line in the input file that contains gis (anywhere in the line), treating a string enclosed in quotes as a single word, even if it contains space(s). It would be nice if you (1) said so, (2) showed what output you want, and (3) included some lines in your file that don't contain gis; as it is, the whole idea of doing this with grep seems misguided. … (Cont’d) Commented Dec 7, 2018 at 0:00
  • (Cont’d) …  Also, it would be nice if you gave the bigger picture.  For example, are the columns separated by spaces or tabs (or a combination)?  If tabs (only), is it a single tab per column, or is it enough tabs to make the columns line up visually (regardless of how long the words are)?  And can the values contain tabs?  From your attempt, I guess you assume that every line will have exactly five columns — there will always be exactly one thing after the pathname.  But will it always be "", or might it be "foo" — or might it be "foo bar" (with embedded spaces)?  … (Cont’d) Commented Dec 7, 2018 at 0:00

3 Answers 3

4

The problem has its roots in this line:

 r=( $(grep gis aaa) )

As you will immediately see if you try:

 printf '<%s>\n' $(grep gis aaa)

Which splits on the characters inside "$IFS" (space, tab, newline by default).

And exposes the values from the file to globbing. Which will transform some *, ? and […] (which ones will depend on the list of files on your pwd and the condition of several shell options).

One (not recommended) solution is to change IFS to the split character and disable globbing for the split:

 IFS=$'\n'; set -f; r=( $(grep gis aaa) )

But a simpler solution is to use what the shell already provide:

readarray -t r <(grep gis aaa) 

That will split on newlines (assuming there are no newlines in the pathnames).

Then, to avoid splitting each line again to get each part which could expose the line to whitespace splitting and globbing, lets remove the leading and trailing parts of the lines.

If from each line we remove everything from the beginning up to the "/ (double quote and slash) and everything from the " (double quote and space) to the end, we will get a clean pathname:

 #!/bin/bash

 function nodatechk() {
    for l do
        l="/${l#*\"/}"                # Remove leading text up to `"/`
        l=${l%\" *}                   # Remove trailing text from `" `
        printf '%s\n' "$l"
    done
 }

 readarray -t r < <(grep gis aaa)

 nodatechk "${r[@]}"
0

A grep-only solution is

grep gis aaa | grep -o '^[^"]*"[^"]*"' | grep -o '"[^"]*"$'

The first grep is the same as what you have in the question.  Obviously, it selects lines that contain gis (anywhere in the line).  The second grep,

grep -o '^[^"]*"[^"]*"'

matches everything up through (and including) the first quoted string on the line (i.e., columns 1 through 4), and, because of the -o option, outputs only those words.  The third grep,

grep -o '"[^"]*"$'

matches the last word quoted string on the line (which, at this point, is column 4 from the original line) and outputs only that string.


P.S. If your file has one tab between each pair of columns, and the values don't contain tabs, the simple way to get the fourth column is

awk -F'\t' '/gis/ { print $4 }' aaa
-1

I read this post and I solved the problem by using 'eval'. So changed this line:

r=( $(grep gis aaa) )

to

eval r="( $(grep gis aaa) )"

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.