combining multiple files by the first column

Question

I have more than fifty files with a distinct name in a directory. For example:

File1:

Type,A,
RR,1,
CD,2,

File2:

Type,B,
CD,2,
FG,3,

File3:

Type,C,
RR,5,
FG,8,
QR,9,

Desired output

Type,A,B,C,
CD,2,2,,
FG,,3,8,
QR,,,9,
RR,1,,5

I tried with join and paste but no luck... Any suggestions?

@Kusalananda I have tried that, but my scenario is bit different from that question. First, number of files will increase and file names are distinct. Secondly, join may give only common values. — Siva
– Siva, Commented Sep 20, 2018 at 12:23
The accepted answer in the duplicate gives a generic function for joining any number of files. The join command is able to return all lines from both files with -a 1 -a 2. — Kusalananda
– Kusalananda ♦, Commented Sep 20, 2018 at 12:34
@Kusalananda Thanks,it works .... but with a little issue... with -a 1 -a 2 gives me QR,9, instead of the desired output. QR,,,9, — Siva
– Siva, Commented Sep 20, 2018 at 13:12

glenn jackman · Accepted Answer · 2018-09-20 16:58:21Z

Here's some fairly tricky GNU awk. GNU awk (gawk) required for arrays-of-arrays

gawk -F, '
    NR  == 1 {n=1; header[n] = $1}
    FNR == 1 {n++; header[n] = $2; next}

    !($1 in data) {data[$1][1] = $1}
    {data[$1][n] = $2}

    # from https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html
    function join(array, start, end, sep,    result, i)
    {
        if (sep == "")
            sep = " "
        else if (sep == SUBSEP) # magic value
            sep = ""
        result = array[start]
        for (i = start + 1; i <= end; i++)
            result = result sep array[i]
        return result
    }

    END {
        print join(header, 1, n, FS)
        PROCINFO["sorted_in"] = "@ind_str_asc"   # for sorted output
        for (type in data)
            print join(data[type], 1, n, FS)
    }
' file{1,2,3}

Type,A,B,C
CD,2,2,
FG,,3,8
QR,,,9
RR,1,,5

I'm assuming that each file has 2 columns, so it's not completely generic.

A version that does not rely on GNU awk (tested with mawk)

mawk -F, '
    NR  == 1 {n=1; header[n] = $1}
    FNR == 1 {n++; header[n] = $2; next}
    {key[$1]; data[$1,n] = $2}
    END {
        for (i=1; i<=n; i++)
            printf "%s%s", header[i], (i==n ? ORS : FS)
        for (type in key) {
            printf "%s%s", type, FS
            for (i=2; i<=n; i++)
                printf "%s%s", data[type,i], (i==n ? ORS : FS)
        }
    }
' file{1,2,3}

According the the NEWS file in git, arrays-of-arrays were added in gawk 4.0: and according to the NEWS check-in log in CVS, 3.1.7 was released in 2009 --can you upgrade? — glenn jackman
– glenn jackman, Commented Sep 20, 2018 at 14:36

Michael Vehrs · Accepted Answer · 2018-09-20 14:48:24Z

This is not particularly hard to do even without real multi-dimensional arrays:

/Type/ { type=$2; types[$2] = 1 }
!/Type/ { data[type,$1] = $2; keys[$1] = 1 }
END {
    m = asorti(types)
    value = "Type"
    for (i = 1; i <= m; i++) {
        value = value "," types[i];
    }
    print value;
    n = asorti(keys)
    for (i = 1; i <= n; i++) {
        value=keys[i]
        for (k = 1; k <= m; k++) {
            value = value "," data[types[k],keys[i]]
        }
        print value;
    }
}

However, you still need GNU awk for the sorting functions.

Stack Exchange Network

combining multiple files by the first column

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

combining multiple files by the first column

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions