25

I'm working on a shell script that will be used by others, and may ingest suspect strings. It's based around awk, so as a basic resiliency measure, I want to have awk output null-terminated strings - the commands that will receive data from awk can thus avoid a certain amount of breakage from strings that contain spaces or not-often-found-in-English characters.

Unfortunately, from the basic awk documentation, I'm not getting how to tell awk to print a string terminated by an ASCII null instead of by a newline. How can I tell awk that I want null-terminated strings?


Versions of awk that might be used:

[user@server1]$ awk --version
awk version 20070501

[user@server2]$ awk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

[user@server3]$ awk -W version
GNU Awk 3.1.7

So pretty much the whole family of awk versions. If we have to consolidate on a version, it'll probably be GNU Awk, but answers for all versions are welcome since I might have to make it work across all of these awks. Oh, legacy scripts.

4
  • Best guide I've found so far: sandrotosi.blogspot.com/2011/09/… - but that's not quite a full answer, and also a random blogspot blog has less SEO juice than SO, so a good SO answer will be useful to more people. Commented Feb 3, 2012 at 18:06
  • Kevin: Want to make that into an answer? Commented Feb 3, 2012 at 18:16
  • Sorry, that uses \0 as the input separator. I'm having trouble getting awk to use it as the output separator. Commented Feb 3, 2012 at 18:37
  • Right, because FS and ORS are different. Commented Feb 3, 2012 at 18:43

4 Answers 4

28

There are three alternatives:

  1. Setting ORS to ASCII zero: Other solutions have awk -vORS=$'\0' but:
    The $'\0' is a construct specific to some shells (bash,zsh).
    So: this command awk -vORS=$'\0' will not work in most older shells.

There is the option to write it as: awk 'BEGIN { ORS = "\0" } ; { print $0 }', but that will not work with most awk versions.

  1. Printing (printf) with character \0: awk '{printf( "%s\0", $0)}'

  2. Printing directly ASCII 0: awk '{ printf( "%s%c", $0, 0 )}'

Testing all alternatives with this code:

#!/bin/bash

test1(){   # '{printf( "%s%c",$0,0)}'|
    a='awk,mawk,original-awk,busybox awk'
    IFS=',' read -ra line <<<"$a"
    for i in "${line[@]}"; do
        printf "%14.12s %40s" "$i" "$1"
        echo -ne "a\nb\nc\n" |
        $i "$1"|
        od -cAn;
    done
}

#test1 '{print}'
test1 'BEGIN { ORS = "\0" } ; { print $0 }'
test1 '{ printf "%s\0", $0}'
test1 '{ printf( "%s%c", $0, 0 )}'

We get this results:

            awk      BEGIN { ORS = "\0" } ; { print $0 }   a  \0   b  \0   c  \0
           mawk      BEGIN { ORS = "\0" } ; { print $0 }   a   b   c
   original-awk      BEGIN { ORS = "\0" } ; { print $0 }   a   b   c
    busybox awk      BEGIN { ORS = "\0" } ; { print $0 }   a   b   c
            awk                     { printf "%s\0", $0}   a  \0   b  \0   c  \0
           mawk                     { printf "%s\0", $0}   a   b   c
   original-awk                     { printf "%s\0", $0}   a   b   c
    busybox awk                     { printf "%s\0", $0}   a   b   c
            awk               { printf( "%s%c", $0, 0 )}   a  \0   b  \0   c  \0
           mawk               { printf( "%s%c", $0, 0 )}   a  \0   b  \0   c  \0
   original-awk               { printf( "%s%c", $0, 0 )}   a  \0   b  \0   c  \0
    busybox awk               { printf( "%s%c", $0, 0 )}   a   b   c

As it can be seen above, the first two solutions work only in GNU AWK.

The most portable is the third solution: '{ printf( "%s%c", $0, 0 )}'.

No solution work correctly in "busybox awk".

The versions used for this tests were:

          awk> GNU Awk 4.0.1
         mawk> mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
 original-awk> awk version 20110810
      busybox> BusyBox v1.20.2 (Debian 1:1.20.0-7) multi-call binary.
Sign up to request clarification or add additional context in comments.

2 Comments

Many blessings on you for specifying the versions that you used! The problem that inspired this question has long since become Not Mine, but it does my heart good to see people leaving helpful, diligent answers. Well done.
Thank you, the %c option was just what I was looking for. It's perfect that it doesn't depend on the current shell's escaping magic.
22

Alright, I've got it.

awk '{printf "%s\0", $0}'

Or, using ORS,

awk -vORS=$'\0' //

8 Comments

When I pipe the results of those incantations into xargs -0, it doesn't split on the \0 that awk is inserting (tested by splitting on something else). :(
@SeanM The first seems not to work, but the second is working for me, are you quite sure the problem is in awk? (try saving the output from just that to a file)
You can check awk's actual output by piping to od -cAn. I found that gawk would output the NUL bytes, but BusyBox awk and nawk on FreeBSD wouldn't. The sandrotosi.blogspot.com technique of printf "%c","" didn't work on those implementations either.
I had to use double-quotes for the -vORS argument awk -vORS=$"\0". This was with gawk 4.0.1.
-v isn't supported by BSD awk, e.g. the one in OSX. Neither inserting \0 into a string works in it, it's treated as the end of the string instead.
|
8

You can also pipe your awk's output through tr:

awk '{...code...}' infile | tr '\n' '\0' > outfile

Just tested, it works at least on Linux and FreeBSD.

If you cannot use newlines as separators (for example, if output records can contain newlines inside), just use some other character that's guaranteed not to appear inside a record, e.g. the one with code 1:

awk 'BEGIN { ORS="\001" } {...code...}' | tr '\001' '\0'

3 Comments

From what I've seen, this is the most portable and reliable answer. tr '\n' '\0' even works in busybox (unlike any use of null characters in busybox's awk). Rather than using \001 (Start of Heading), I recommend \036 (U+001e, Information Separator Two, a.k.a. Record Separator, RS) since the information separators are made for this purpose. (#2/RS maps to lines (awk's default ORS) while #1, Unit Separator, would be akin to awk's FS.) More at en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text
Since UNIX paths can contain any bytes except \0, you are not doing it right if you use anything else, even if you replace it with \0 afterwards: any inline bytes with the same code would be replaced, too.
What Ivan said: \0 will also allow you to post-process your lines w/ e.g. xargs, which will fail if there's an embedded single quote in the line and it's not null-terminated. Adam's suggestion is a poor one.
-1

I've solved printing ASCII 0 from awk. I use UNIX command printf "\000"

echo | awk -v s='printf "\000"' '{system(s);}'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.