I am currently working with data of many merged .csv files. Unfortunately, those merges are faulty sometimes.
This is best explained by this example:
"var1", "var2", "var3", "var4", "var5"
"2001", "yellow", "123", "abc", "bcdefg"
"2002", "yellow", "123", "abw", "asdfkl"
"2001", "green", "abe"
"2002", "green", "abp"
"2001", "blue", "324", "abx", "badsf"
"2002", "blue", "231", "abl", "cpq"
So in line 3 and 4 the values for var3 and var5 are missing. The errors of which variables (columns) are missing is always the same.
I want my csv to look like this:
"var1", "var2", "var3", "var4", "var5"
"2001", "yellow", "123", "abc", "bcdefg"
"2002", "yellow", "123", "abw", "asdfkl"
"2001", "green", , "abe" ,
"2002", "green", , "abp" ,
"2001", "blue", "324", "abx", "badsf"
"2002", "blue", "231", "abl", "cpq"
So now lines 3 and 4 actually have missing values for var3 and var5. The errors not always happen for (in this example) "green" but could also be for another group.
My idea would be that the lines are scanned for columns and if there is not the same amount of columns as in the header ("var1", "var2", etc.) then the new empty columns are added.
I will have to do this for many different files, but once I have an idead on how to do this, I think I can to a bash script loop.
[edit]: I want to clarify, the dataset is quite big. With at least 19 variables (columns). (Another file where i need to check has over 60 variables)
Right now I am thinking of a solution with awk. Something like this:
awk '{ if (NF<19) {$7=$7","#NA","#NA}}' file1 > file2
Here it should insert two columns after the 7th column if there are not 19 columns (which it should have). Will try this later...
"2001", "green", "abe"
should become as"2001", "green", , "abe" ,
but not"2001", "green","abe" , ,
?"abe"
information is invar4
and not invar3
. Think of it this way: the initial data for the group "green" simply has no information onvar3
andvar5
so there are no columns. In the merged file which I have, this leads to this error."2001", "green", "#", "abe","#"
would also be fine!