CSV Splitter in Bash

Question

I am splitting a csv file where the first 3 columns will be common for all the output files.

input file:

h1 h2 h3 o1 o2 ....
a  b  c  d  e  ....
a1 b1 c1 d1 e1 ....

output files:

o1.csv:

h1 h2 h3 o1
a  b  c  d  
a1 b1 c1 d1

o2.csv:

h1 h2 h3 o2
a  b  c  e
a1 b1 c1 e1

So if there are n columns in the input file , the code creates n-3 output files. However my code is inefficient and is quite slow. It takes 20 seconds for 50000 rows.

old_IFS=$IFS
START_TIME=`date`
DELIMITER=,           

# reading and writing headers    
headers_line=$(head -n 1 "$csv_file")
IFS=$DELIMITER read -r -a headers <<< $headers_line
common_headers=${headers[0]}$DELIMITER${headers[1]}$DELIMITER${headers[2]}

for header in "${headers[@]:3}"
do
   # writing headers to every file
   echo $common_headers$DELIMITER$header > "$header$START_TIME".csv
done

# reading csv file line by line
i=1
while IFS=$DELIMITER read -r -a row_data
do
    test $i -eq 1 && ((i++)) && continue      # ignoring headers

    j=0

    common_data=${row_data[0]}$DELIMITER${row_data[1]}$DELIMITER${row_data[2]}
    for val in "${row_data[@]:3}"
    do
        #  appending row to every new csv file
        echo $common_data$DELIMITER$val >> "${headers[(($j+3))]}$START_TIME".csv
        ((j++)) 
    done

done < $csv_file
IFS=${old_IFS}

Any suggestions are appreciated.

janos · Accepted Answer · 2017-05-22 06:32:49Z

Bash is not efficient for processing large files line by line. For small data it's fine, but when a script starts to feel heavy, it's good to look for other alternatives. Also note that the line by line processing and breaking into columns is not easy to get right, I bet you spent quite some time on this. You wrote it well, but the result is not particularly easy to read, and I'm afraid this is as good as it gets with Bash.

So what's the alternative? Try with cut in a loop. Yes that will imply reading the file n-3 times, but I bet it will be faster than the pure Bash solution. And it will be nicely readable too, which is an extremely important benefit.

A few notes about technique:

Use $(...) instead of `...`
You took care to save IFS and then restore at the end, but it was unnecessary: when you do var=... somecmd, the value of var is only set in the environment of somecmd, it is unchanged for the current script. That being said, what you did is safe, so it's fine.
The incrementing i variable in the loop is a bit misleading, because i is a common name in counting loops, and at first I thought the count itself has some purpose. But it doesn't, this variable is used only to distinguish the first line from the others. I would write differently, to make the intention perfectly obvious.

Stack Exchange Network

CSV Splitter in Bash

1 Answer 1

You must log in to answer this question.

Hot Network Questions

CSV Splitter in Bash

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions