2

I'm trying to figure out how to write a .awk script that takes a .csv file as input and outputs it without commas and with columns aligned. So far I've tried this :

{ printf "%-10s %s\n", $1, $2, $3 ,$4 }

But this only outputs the data in the first two fields aligned. It does a good job of removing the comma delimiters but there's commas within double quotes in the fourth column that I wonder if will cause an issue. Any guidance is much appreciated I'm very new using awk.

Sample input is like:

Name,Last Name,Gender,Pet
Kit,Rattenberie,Male,"Crake, african black"
Cliff,Lakes,Male,"Red phalarope"
Tirrell,Stables,Male,"Rhea, greater"
Cherry,William,Female,"Crow, house"

Desired output will be something like:

Name    Last Name    Gender   Pet
Kit     Rattenberie  Male    "Crake, african black"
Cliff   Lakes        Male    "Red phalarope"
Tirrell Stables      Male    "Rhea, greater"
Cherry  William      Female  "Crow, house"

For a .csv file of 10 rows. Thanks in advance

5
  • 2
    please add sample input as well Commented Oct 11, 2022 at 20:55
  • sure thing, just added. I'm not sure the max width of each column but it's not more than 25 characters per cell. output delimiter would just be whatever amount of spaces aligns the data Commented Oct 11, 2022 at 21:01
  • Yeah I guess my example wasn't ideal. Since the words are varying lengths the amount of spaces would have to vary as well I suppose Commented Oct 11, 2022 at 21:10
  • @markp-fuso I edited the examples to be a bit more accurate Commented Oct 11, 2022 at 21:14
  • parsing a general purpose CSV with awk isn't trivial, even with GNU awk. What are the border cases? Are there fields with a newline or a double quote as part of the data? Also. don't you need to double-quote "Last Name" in the output? Commented Oct 11, 2022 at 23:31

3 Answers 3

5

Using we can transform the input data from CSV to "pretty print" format with a command line option:

mlr --c2p cat ./input
Name    Last Name   Gender Pet
Kit     Rattenberie Male   Crake, african black
Cliff   Lakes       Male   Red phalarope
Tirrell Stables     Male   Rhea, greater
Cherry  William     Female Crow, house

It drops the quotes though. The --barred option is nice too:

mlr --c2p --barred cat ./input
+---------+-------------+--------+----------------------+
| Name    | Last Name   | Gender | Pet                  |
+---------+-------------+--------+----------------------+
| Kit     | Rattenberie | Male   | Crake, african black |
| Cliff   | Lakes       | Male   | Red phalarope        |
| Tirrell | Stables     | Male   | Rhea, greater        |
| Cherry  | William     | Female | Crow, house          |
+---------+-------------+--------+----------------------+

An awk technique that's more programming: keep track of the max width of each column while you're reading the input file, then use that to print the data at the end: this is essentially re-implementing column -t

awk -v FPAT='"[^"]*"|[^,]+' '
    {
        for (i=1; i<=NF; i++) {
            data[NR][i] = $i
            if (length($i) > maxw[i]) maxw[i] = length($i)
        }
    }
    END {
        for (i=1; i<=NR; i++) {
            for (j=1; j<=length(data[i]); j++) {
                printf "%-*s  ", maxw[j], data[i][j]
            }
            printf "\n"
        }
    }
' ./input
Name     Last Name    Gender  Pet
Kit      Rattenberie  Male    "Crake, african black"
Cliff    Lakes        Male    "Red phalarope"
Tirrell  Stables      Male    "Rhea, greater"
Cherry   William      Female  "Crow, house"
Sign up to request clarification or add additional context in comments.

1 Comment

Aah this looks good but I'm supposed to be using a .awk script
3

Using gnu-awk, you can use this:

awk -v FPAT='"[^"]*"|[^,]+' '{
   for (i=1; i<=NF; ++i) $i = sprintf("%-12s", $i)} 1' file

Name     Last Name    Gender  Pet
Kit      Rattenberie  Male    "Crake, african black"
Cliff    Lakes        Male    "Red phalarope"
Tirrell  Stables      Male    "Rhea, greater"
Cherry   William      Female  "Crow, house"

Or if width is totally unpredictable then use this awk + column solution:

awk -v FPAT='"[^"]*"|[^,]+' -v OFS=';' '{$1=$1} 1' file |
column -s';' -t

Name     Last Name    Gender  Pet
Kit      Rattenberie  Male    "Crake, african black"
Cliff    Lakes        Male    "Red phalarope"
Tirrell  Stables      Male    "Rhea, greater"
Cherry   William      Female  "Crow, house"

If you want to create an awk script then use:

cat col.awk

BEGIN {
   FPAT="\"[^\"]*\"|[^,]+"
   OFS=";"
}
{$1 = $1}
1

Use it as:

awk -f col.awk file.csv | column -s';' -t

6 Comments

Hmm I'm running this from a .awk file so I'm getting some syntax errors, example run is like awk -F"," -f script.awk file.csv
What's output of awk -v FPAT='"[^"]*"|[^,]+' -v OFS=';' '{$1=$1} 1' file.csv | column -s';' -t ?
So I suppose I only put the FPAT='"[^"]*"|[^,]+' -v OFS=';' '{$1=$1} 1' in the BEGIN{} block in my script file and then apply column -t when invoking it? This brings errors though
check my updated answer to create a col.awk script and use it
If you want to declare FPAT in the BEGIN block, you can't use single quotes and have to escape doubles: BEGIN {FPAT = "\"[^\"]*\"|[^,]+"}
|
2

One awk idea using *.awk script (per OP's comment), and having awk determine the max width of each column:

$ cat script.awk
BEGIN { FPAT="\"[^\"]*\"|[^,]+" }                            # instead of parsing on field delimiter (via FS) ... parse on field format via (FPAT)
      { for (i=1;i<=NF;i++)
            w[i]= length($i) > w[i] ? length($i) : w[i]      # keep track of max width of each column
        lines[FNR]=$0                                        # save entire line
      }
END   { for (i=1;i<=FNR;i++) {                               # loop through each saved line
            n=patsplit(lines[i],a)                           # reparse based on FPAT, storing fields in array a[]
            for (j=1;j<n;j++)                                # loop through array entries ...
                printf "%-*s%s", w[j], a[j], OFS             # printing to stdout
            print a[n]                                       # print last field plus "\n"
        }
      }

Or using a multi-dimensional array to store the input thus eliminating the 2nd parsing (via patsplit()) of the input data:

$ cat script.awk
BEGIN { FPAT="\"[^\"]*\"|[^,]+" }
      { for (i=1;i<=NF;i++) {
            w[i]= length($i) > w[i] ? length($i) : w[i]
            fields[FNR][i]=$i
        }
      }
END   { for (i=1;i<=FNR;i++) {
            for (j=1;j<NF;j++)
                printf "%-*s%s", w[j], fields[i][j], OFS
            print fields[i][NF]
        }
      }

NOTES:

  • assumes entire file can fit into memory (via the awk/lines[] or awk/fields[][] array)
  • requires GNU awk for FPAT and multi-dimensional array support

Both of these generate:

$ awk -f script.awk file
Name    Last Name   Gender Pet
Kit     Rattenberie Male   "Crake, african black"
Cliff   Lakes       Male   "Red phalarope"
Tirrell Stables     Male   "Rhea, greater"
Cherry  William     Female "Crow, house"

1 Comment

Nice and tidy. The final printf could just be print a[n] -- we don't really care how long it is

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.