Using awk array values as column indexes

Question

Data:

EMAIL,NAME,KEY,LOCATION
[email protected],Joe,ABC,Denver
[email protected],Jane,EFD,Denver
...

Overall goal: Script that takes in which fields I care about and produce multiple files with all unique columns in the data. E.g.:

myScript.sh NAME LOCATION

Produces:

Joe_Denver.csv - contains all lines with "Joe" and "Denver" in the
NAME and LOCATION columns
Jane_Denver.csv - contains all lines with "Jane" and "Denver" in the NAME and LOCATION columns

What I have so far:

Bash script that takes in some arbitrary number of fields and stores it in an array
Finds the column index numbers of the fields and stores that in an array

I'm trying to:

use AWK to take in the array of indexes and then spit out all the unique combinations of the fields I specified then store that in an array
iterate through that array of field combinations, printing out a file for each combination that contains all lines in the data that has those values in those columns

My AWK command for the 1st step would look something like:

awk -F, -v colIdxs="${bashIdxs[*]}" '!seen[$colIdxs[*]]++ {print $colIdxs[*]}'

That is I'm hoping to use the indexes stored in bashIdxs as column indexes inside an awk script (where bashIdxs can be of arbitrary size).

How would this be done? In addition, if there's a better way to accomplish what I'm trying to do (I'm sure there is), I'd love to know out of curiosity as well.

Ed Morton · Accepted Answer · 2018-03-07 21:05:51Z

2

Untested but will be close if not exactly right:

colNames="$*"
awk -v colNames="$colNames" '
BEGIN {
    split(colNames,tmp)
    for (i in tmp) {
        names[tmp[i]]
    }
    FS=OFS=","
}
NR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in names) {
            f[++nf] = $i
        }
    }
    hdr = $0
    next
}
{
    out = ""
    for (i=1; i<=nf; i++) {
        out = (out=="" ? "" : out "_") $(f[i])
    }
    out = out ".csv"
    if ( !seen[out]++ ) {
        print hdr > out
    }
    print > out
}
' file

You'll need to change print > out to print >> out; close(out) if you aren't using GNU awk and get a "too many open files" error.

edited Mar 7, 2018 at 21:05

answered Mar 7, 2018 at 21:00

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

jkang Over a year ago

Thanks! would you mind annotating what some of these lines mean? I'm new to AWK and find I can't really follow along. Particularly: names[tmp[i]] - what does this do? and later on: if ($i in names) { - $i is a column number; does that mean names[tmp[i]] stores all column numbers? f[++nf] = $i; - creates array of stored column numbers? if ( !seen[out]++ ) { print hdr > out }

Ed Morton Over a year ago

Sorry I don't have time to do that but if you try to follow it and read the man pages and THEN have any specific questions about any parts of it I'll be happy to answer them.

jkang Over a year ago

Sorry about that, hit the "enter" button by mistake, now it won't let me edit after 5 min...

Ed Morton Over a year ago

names[tmp[i]] populates names[] indexed by the values stored in tmp[] so names[] ends up storing the set of all name strings like "NAME". Yes, f[++nf] = $i creates an array that maps the desired output column numbers to the input column numbers. !seen[foo]++ is a common awk idiom to do something the first time foo occurs. In this case that something is to print the header line to the new output file.

jkang Over a year ago

Also: does print > out automatically print $0?

|

karakfa · Accepted Answer · 2018-03-07 20:57:32Z

1

awk to the rescue!

$ awk -F, -v cols='NAME,LOCATION' '
        NR==1 {for(i=1;i<=NF;i++) if(FS cols FS ~ FS $i FS) sel[i]; h=$0; next}
              {key=""; 
               for(i=1;i<=NF;i++) if(i in sel) key=(key==""?$i:key"_"$i); file=key".csv"; 
               if(!(key in header)) {print h > file; header[key]} 
               print > file}' file

gives

$ head *_*.csv
==> Jane_Denver.csv <==
EMAIL,NAME,KEY,LOCATION
[email protected],Jane,EFD,Denver

==> Joe_Denver.csv <==
EMAIL,NAME,KEY,LOCATION
[email protected],Joe,ABC,Denver

NB. if there are too many files open for your OS (based on input data and number of unique keys), you may need to close files...

answered Mar 7, 2018 at 20:57

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

5 Comments

jkang Over a year ago

Thanks! if(FS cols FS ~ FS $i FS) - what does this do? I realize it's a regular expression compare but where did "cols" come from? If I'm right, this compares it to ",colNumber," right?

karakfa Over a year ago

cols='NAME,LOCATION' is the columns given as input. Checks whether any columns in the file matches the given column selection.

jkang Over a year ago

Stupid of me, I missed that part where the variable was set. Thanks!

jkang Over a year ago

Another question about if(FS cols FS ~ FS $i FS), suppose I had fields "NAME" and "LAST_NAME" but I only want to match "NAME", would the regex match only NAME? And what is the logic behind that?

karakfa Over a year ago

Yes, it will match whatever is provided. If you want to only match NAME, just put cols='NAME', if there are more than one provide a comma separated list as in the example.

Collectives™ on Stack Overflow

Using awk array values as column indexes

2 Answers 2

8 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related