Fixing a .csv file where some rows have missing columns

Question

I am currently working with data of many merged .csv files. Unfortunately, those merges are faulty sometimes.

This is best explained by this example:

"var1", "var2", "var3", "var4", "var5"
"2001", "yellow", "123", "abc", "bcdefg"
"2002", "yellow", "123", "abw", "asdfkl"
"2001", "green", "abe"
"2002", "green", "abp"
"2001", "blue", "324", "abx", "badsf"
"2002", "blue", "231", "abl", "cpq"

So in line 3 and 4 the values for var3 and var5 are missing. The errors of which variables (columns) are missing is always the same.

I want my csv to look like this:

"var1", "var2", "var3", "var4", "var5"
"2001", "yellow", "123", "abc", "bcdefg"
"2002", "yellow", "123", "abw", "asdfkl"
"2001", "green", , "abe" ,
"2002", "green", , "abp" ,
"2001", "blue", "324", "abx", "badsf"
"2002", "blue", "231", "abl", "cpq"

So now lines 3 and 4 actually have missing values for var3 and var5. The errors not always happen for (in this example) "green" but could also be for another group.

My idea would be that the lines are scanned for columns and if there is not the same amount of columns as in the header ("var1", "var2", etc.) then the new empty columns are added.

I will have to do this for many different files, but once I have an idead on how to do this, I think I can to a bash script loop.

[edit]: I want to clarify, the dataset is quite big. With at least 19 variables (columns). (Another file where i need to check has over 60 variables)

Right now I am thinking of a solution with awk. Something like this:

awk '{ if (NF<19) {$7=$7","#NA","#NA}}' file1 > file2

Here it should insert two columns after the 7th column if there are not 19 columns (which it should have). Will try this later...

explain why this "2001", "green", "abe" should become as "2001", "green", , "abe" , but not "2001", "green","abe" , , ? — RomanPerekhrest, Commented May 22, 2018 at 14:21
Because there is actually a "frameshift" that happens. The "abe" information is in var4 and not in var3. Think of it this way: the initial data for the group "green" simply has no information on var3 and var5 so there are no columns. In the merged file which I have, this leads to this error. — TobiasGold, Commented May 22, 2018 at 14:31
can't you use a place holder like "#" on the empty vars to your output file gets all fields filled easing the process of field matching? — vfbsilva, Commented May 22, 2018 at 14:36
Hmm actually that should be OK. So something like "2001", "green", "#", "abe","#" would also be fine! — TobiasGold, Commented May 22, 2018 at 14:50
@TobiasGold I'm glad to help. If you can't log in back into the unregistered user, you could contact support or flag the post for moderator attention and ask them to merge the accounts. — Norrius, Commented May 24, 2018 at 9:41

Norrius · Accepted Answer · 2018-05-23 11:56:49Z

The simplest thing that comes to mind is to split the lines on commas and insert extra commas where there are only two of them. The obvious limitation is that if you have commas in the actual values, this will break.

$ cat test.csv | sed -r 's/^([^,]*),([^,]*),([^,]*)$/\1,\2, ,\3, /g'
"var1", "var2", "var3", "var4", "var5"
"2001", "yellow", "123", "abc", "bcdefg"
"2002", "yellow", "123", "abw", "asdfkl"
"2001", "green", , "abe", 
"2002", "green", , "abp", 
"2001", "blue", "324", "abx", "badsf"
"2002", "blue", "231", "abl", "cpq"

For something more general, I would probably write a Python script (it has CSV capabilities built-in). For example, this reads CSVs from stdin and outputs to stdout:

#!/usr/bin/env python
import sys
import csv

missing = [3, 5]  # 1-indexed positions of missing values
missing.sort()  # enforce the increasing order
reader = csv.reader(sys.stdin, delimiter=',', skipinitialspace=True)
writer = csv.writer(sys.stdout)
header = next(reader)  # get first row (header)
writer.writerow(header)  # write it back
for row in reader:
    if len(row) < len(header):
        # row shorter than header -> insert empty strings
        # inserting changes indices so `missing` must be sorted
        for idx in missing:
            row.insert(idx - 1, '')
    writer.writerow(row)

The benefit of using a real CSV parser is that it correctly handles commas or quotes in values and other edge cases. The output format will also be a correct CSV, but a bit different from what you had:

$ cat test.csv | python test.py 
var1,var2,var3,var4,var5
2001,yellow,123,abc,bcdefg
2002,yellow,123,abw,asdfkl
2001,green,,abe,
2002,green,,abp,
2001,blue,324,abx,badsf
2002,blue,231,abl,cpq

As you can see, there are no superfluous quotes or spaces after commas. If you really need them, I can look into configuring the CSV dialect for the writer.

@SivaPrasath The question specifies that it's always the same columns that are missing. — Norrius, Commented May 23, 2018 at 9:45
I'm not sure, because "My idea would be that the lines are scanned for columns and if there is not the same amount of columns as in the header ("var1", "var2", etc.) then the new empty columns are added." — Siva, Commented May 23, 2018 at 9:56
@SivaPrasath "The errors of which variables (columns) are missing is always the same". — Norrius, Commented May 23, 2018 at 10:00

Stack Exchange Network

Fixing a .csv file where some rows have missing columns

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Fixing a .csv file where some rows have missing columns

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions