1

Data:

EMAIL,NAME,KEY,LOCATION
[email protected],Joe,ABC,Denver
[email protected],Jane,EFD,Denver
...

Overall goal: Script that takes in which fields I care about and produce multiple files with all unique columns in the data. E.g.:

myScript.sh NAME LOCATION

Produces:

Joe_Denver.csv - contains all lines with "Joe" and "Denver" in the
NAME and LOCATION columns
Jane_Denver.csv - contains all lines with "Jane" and "Denver" in the NAME and LOCATION columns

What I have so far:

  • Bash script that takes in some arbitrary number of fields and stores it in an array
  • Finds the column index numbers of the fields and stores that in an array

I'm trying to:

  • use AWK to take in the array of indexes and then spit out all the unique combinations of the fields I specified then store that in an array
  • iterate through that array of field combinations, printing out a file for each combination that contains all lines in the data that has those values in those columns

My AWK command for the 1st step would look something like:

awk -F, -v colIdxs="${bashIdxs[*]}" '!seen[$colIdxs[*]]++ {print $colIdxs[*]}'

That is I'm hoping to use the indexes stored in bashIdxs as column indexes inside an awk script (where bashIdxs can be of arbitrary size).

How would this be done? In addition, if there's a better way to accomplish what I'm trying to do (I'm sure there is), I'd love to know out of curiosity as well.

2 Answers 2

2

Untested but will be close if not exactly right:

colNames="$*"
awk -v colNames="$colNames" '
BEGIN {
    split(colNames,tmp)
    for (i in tmp) {
        names[tmp[i]]
    }
    FS=OFS=","
}
NR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in names) {
            f[++nf] = $i
        }
    }
    hdr = $0
    next
}
{
    out = ""
    for (i=1; i<=nf; i++) {
        out = (out=="" ? "" : out "_") $(f[i])
    }
    out = out ".csv"
    if ( !seen[out]++ ) {
        print hdr > out
    }
    print > out
}
' file

You'll need to change print > out to print >> out; close(out) if you aren't using GNU awk and get a "too many open files" error.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks! would you mind annotating what some of these lines mean? I'm new to AWK and find I can't really follow along. Particularly: names[tmp[i]] - what does this do? and later on: if ($i in names) { - $i is a column number; does that mean names[tmp[i]] stores all column numbers? f[++nf] = $i; - creates array of stored column numbers? if ( !seen[out]++ ) { print hdr > out }
Sorry I don't have time to do that but if you try to follow it and read the man pages and THEN have any specific questions about any parts of it I'll be happy to answer them.
Sorry about that, hit the "enter" button by mistake, now it won't let me edit after 5 min...
names[tmp[i]] populates names[] indexed by the values stored in tmp[] so names[] ends up storing the set of all name strings like "NAME". Yes, f[++nf] = $i creates an array that maps the desired output column numbers to the input column numbers. !seen[foo]++ is a common awk idiom to do something the first time foo occurs. In this case that something is to print the header line to the new output file.
Also: does print > out automatically print $0?
|
1

awk to the rescue!

$ awk -F, -v cols='NAME,LOCATION' '
        NR==1 {for(i=1;i<=NF;i++) if(FS cols FS ~ FS $i FS) sel[i]; h=$0; next}
              {key=""; 
               for(i=1;i<=NF;i++) if(i in sel) key=(key==""?$i:key"_"$i); file=key".csv"; 
               if(!(key in header)) {print h > file; header[key]} 
               print > file}' file

gives

$ head *_*.csv
==> Jane_Denver.csv <==
EMAIL,NAME,KEY,LOCATION
[email protected],Jane,EFD,Denver

==> Joe_Denver.csv <==
EMAIL,NAME,KEY,LOCATION
[email protected],Joe,ABC,Denver

NB. if there are too many files open for your OS (based on input data and number of unique keys), you may need to close files...

5 Comments

Thanks! if(FS cols FS ~ FS $i FS) - what does this do? I realize it's a regular expression compare but where did "cols" come from? If I'm right, this compares it to ",colNumber," right?
cols='NAME,LOCATION' is the columns given as input. Checks whether any columns in the file matches the given column selection.
Stupid of me, I missed that part where the variable was set. Thanks!
Another question about if(FS cols FS ~ FS $i FS), suppose I had fields "NAME" and "LAST_NAME" but I only want to match "NAME", would the regex match only NAME? And what is the logic behind that?
Yes, it will match whatever is provided. If you want to only match NAME, just put cols='NAME', if there are more than one provide a comma separated list as in the example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.