42

When I compile this code using different compilers and inspect the output in a hex editor I am expecting to find the string "Nancy" somewhere.

#include <stdio.h>

int main()
{
    char temp[6] = "Nancy";
    printf("%s", temp);

    return 0;
}
  1. The output file for gcc -o main main.c looks like this:

    sdf

  2. The output for g++ -o main main.c, I can't see to find "Nancy" anywhere.

  3. Compiling the same code in visual studio (MSVC 1929) I see the full string in a hex editor:

Why do I get some random bytes in the middle of the string in (1)?

11
  • 9
    It will be illuminating to look at the assembly code Commented Apr 19, 2022 at 22:52
  • 3
    Note that you will always get "Nancy" verbatim in object file if the char temp[6] is outside a function, so that it gets allocated statically instead of stack allocation. Similarly if you make it static char temp[6] inside a function, though that could be subject to compiler optimizations. Commented Apr 20, 2022 at 9:59
  • 6
    I am expecting to find the string "Nancy" somewhere. That seemingly makes sense. Except it doesn't because programming languages are defined in an "as-if" fashion. The program should work in a certain way, but if you don't push it to act in a particular way, the compiler is free to do other things to optimize it. Here, you never directly accessed the contents of the string. printf is an intrinsic and the compiler optimizes it in a special fashion, it's not literally a call of a function called printf, even though such a function exists from the programmer's perspective. Commented Apr 20, 2022 at 12:38
  • 2
    It's also possible that the output file has been compressed in some way. Commented Apr 20, 2022 at 19:42
  • 9
    Another episode of "Programmers discover programming language constructs are abstractions." :D Commented Apr 21, 2022 at 19:11

3 Answers 3

37

There is no single rule about how a compiler stores data in the output files it produces.

Data can be stored in a “constant” section.

Data can be built into the “immediate” operands of instructions, in which data is encoded in various fields of the bits that encode an instruction.

Data can be computed from other data by instructions generated by the compiler.

I suspect the case where you see “Nanc” in one place and “y” in another is the compiler using a load instruction (may be written with “mov”) that loads the bytes forming “Nanc” as an immediate operand and another load instruction that loads the bytes forming “y” with a trailing null character, along with other instructions to store the loaded data on the stack and pass its address to printf.

You have not provided enough information to diagnose the g++ case: You did not name the compiler or its version number or provide any part of the generated output.

Sign up to request clarification or add additional context in comments.

8 Comments

Yes this is confirmed via godbolt godbolt.org/z/Ph9cnrKEh. Although I am not sure how healthy what MSVC is doing since I assume there "Nancy" is in some read only constant section making it potentially harmful to modify, but I might be mistaken
@Lala5th It's presumably being copied from there into the local array, it's not the array itself.
@Lala5th The compiler could also detect that temp is never modified, so it can be treated as a constant.
@Lala5th That's the difference between char temp[] = "Nancy" (writable local array initialized with a copy of that read-only string literal) and char *temp = "Nancy" (pointer set to point at the read-only string literal, no copy made). temp[0] = 'D' is legal in the former case but not in the latter.
@Eric: It would have been better for the question to name versions for GCC and G++, but they did name the compiler: it's g++. On Godbolt, all versions of g++ except for the oldest (4.1) materialize the local non-const array with immediates, when compiling for x86-64 at any optimization level. (And the question did show a complete command, so we know optimization level was the default -O0.) godbolt.org/z/s5TPfxn8n shows C vs. C++ mode (-xc vs. -xc++; same code-gen, unsurprisingly.) Seems always a dword and word store, with the dword holding the first 4 bytes.
|
18

I reproduced it, using gcc 9.3.0 (Linux Mint 20.2), on x86-64 system (Intel

Result of hexdump -C:

enter image description here

Note the byte sequence is the same.

So I use gcc -S -c:

    .file   "teststr.c"
    .text
    .section    .rodata
.LC0:
    .string "%s"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    endbr64
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movq    %fs:40, %rax
    movq    %rax, -8(%rbp)
    xorl    %eax, %eax
    movl    $1668178254, -14(%rbp) # NOTE THIS PART HERE
    movw    $121, -10(%rbp)        # AND HERE
    leaq    -14(%rbp), %rax
    movq    %rax, %rsi
    leaq    .LC0(%rip), %rdi
    movl    $0, %eax
    call    printf@PLT
    movl    $0, %eax
    movq    -8(%rbp), %rdx
    xorq    %fs:40, %rdx
    je  .L3
    call    __stack_chk_fail@PLT
.L3:
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
    .section    .note.GNU-stack,"",@progbits
    .section    .note.gnu.property,"a"
    .align 8
    .long    1f - 0f
    .long    4f - 1f
    .long    5
0:
    .string  "GNU"
1:
    .align 8
    .long    0xc0000002
    .long    3f - 2f
2:
    .long    0x3
3:
    .align 8
4:

The highlighted value 1668178254 is hex 636E614E or "cnaN" (which, due to the endian reversal as x86 is a little-endian system, becomes "Nanc") in ASCII encoding, and 121 is hex 79, or "y".

So it uses two move instructions instead of a loop copy from a byte string section of the file given it's a short string, and the intervening "garbage" is (I believe) the following movw instruction. Likely a way to optimize the initialization, versus looping byte-by-byte through memory, even though no optimization flag was "officially" given to the compiler - that's the thing, the compiler can do what it wants to do in this regard. Microsoft's compiler, then, seems to be more "pedantic" in how it compiles because it does, in fact, apparently forgo that optimization in favor of putting the string together contiguously.

4 Comments

"is (I believe) the following movw instruction" — no need for "belief", it's definite: 66 C7 45 F6 79 00 is exactly mov word [rbp-10], 0x0079.
@Ruslan yeah, thanks for confirming; I'm just not excellent at parsing x86 opcodes. I later did confirm it with an "objdump".
Yeah, this is what tools like objdump are for. godbolt.org/z/nEfv98MbE even has a "binary mode" where it compiles+assembles and shows you disassembly along with the machine code. (see also my comments on Eric's answer for why GCC does mov eax, 'y' to avoid LCP stalls with optimization enabled). I wouldn't waste my time looking in a raw hexdump of the whole binary and trying to remember numeric opcodes. Normally all you need to know for optimization is opcode and prefix lengths, although I do remember some common ones like B8..F
Usually the kinds of optimizations you get with optimization off just depend more on the internal structure of the compiler than on the compiler writers being "pedantic" or not. Both are perfectly valid ways to put chars on the stack, but I could guess that perhaps gcc creates a picture of what it wants the stack to look like, then makes it look that way, while MSVC creates a picture of the instructions that put the data on the stack, then optimizes them later (if enabled)
7

Generally a compiled program is split into different types of "section". The assembler file will use directives to switch between them.

  • Code (".text")
  • Static read-only data (".section .rodata")
  • Initialised global or static variables (".data")
  • Uninitialised (or zero-initialized) global or static variables (".bss")

String literals in C can be used in two different ways.

  • As a pointer to constant data.
  • As an initaliser for an array.

If a string literal is used as a pointer then it is likely the compiler will place the string data in the read only data section.

If a string literal is used to initialise a global/static array then it is likely the compiler will place the array in the initilised data section (or the read-only data section if the array is declared as const).

However in your case the array you are initialising is an automatic local variable. So it can't be pre-initialised before program start. The compiler must include code to initialise it each time your function runs.

The compiler might choose to do that by storing the string in a read-only data location and then using a copy routine (either inlined or a call) to copy it to the local array. It may chose to simply generate instructions to set the elements of the array one by one. It may choose to generate instructions that set several array elements at the same time.

In your example it looks like MSVC has chosen to use a copy routine, so the string appears sequentially in the file. gcc on the other hand has chosen to use a 4 byte move instruction followed by a two byte move instruction, both with literals as inputs. So the literal is split up into two parts.

P.S. I've noticed some people posting https//godbolt.org/ links on other answers to this question. The Compiler Explorer is a useful tool but be aware that it hides the section switching directives from the assembler output by default.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.