9

I am writing a socket client-server application where the server needs to send a large buffer to a client and all buffers should be processed separately, so I want to put the buffer length in the buffer so that the client can read the length of data from the buffer and process accordingly.

To put the length value I need to divide an integer value in one byte each and store it in a buffer to be sent over the socket. I am able to break the integer into four parts, but at the time of joining I am not able to retrieve the correct value. To demonstrate my problem I have written a sample program where I am dividing int into four char variables and then join it back in another integer. The goal is that after joining I should get the same result.

Here is my small program.

#include <stdio.h>

int main ()
{
    int inVal = 0, outVal =0;
    char buf[5] = {0};

    inVal = 67502978;

    printf ("inVal: %d\n", inVal);

    buf[0] = inVal & 0xff;
    buf[1] = (inVal >> 8) & 0xff;
    buf[2] = (inVal >> 16) & 0xff;
    buf[3] = (inVal >> 24) & 0xff;

    outVal = buf[3];
    outVal = outVal << 8;
    outVal |= buf[2];
    outVal = outVal << 8;
    outVal |= buf[1];
    outVal = outVal << 8;
    outVal |= buf[0];

    printf ("outVal: %d\n",outVal);
    return 0;
}

Output

inVal: 67502978 outVal: -126

What am I doing wrong?

2

7 Answers 7

19

One problem is that you are using bit-wise operators on signed numbers. This is always a bad idea and almost always incorrect. Please note that char has implementation-defined signedness, unlike int which is always signed.

Therefore you should replace int with uint32_t and char with uint8_t. With such unsigned types you eliminate the possibility of using bit shifts on negative numbers, which would be a bug. Similarly, if you shift data into the sign bits of a signed number, you will get bugs.

And needless to say, the code will not work if integers are not 4 bytes large.

Sign up to request clarification or add additional context in comments.

4 Comments

@LPs My thinking too. The general algorithmic idea is, if I'm not mistaken, endianness-independent because the bit shift operator is endianness-ignorant (or -invariant, or whatever: left shift is always multiplying, right-shift dividing). This is the way to go in order to implement ntohl and friends without looking at the architecture's endianness.
I don't understand Should you use bit shifts on a number that is negative, you might get bugs.. How can he have negative numbers if he changes the types to uint32_t and uint8_t?
@gurka The point is to use unsigned types of specified width to prevent such bugs. Since you misunderstood, the answer is apparently not optimally worded.
Unfortunately, C is a very inconsistent and irrational language. What's not mentioned here is that even if you use uint8_t, you could end up with negative numbers, because of the integer promotion rules in C. For example ~uint8 << n always invokes undefined behavior, unlike ~uint32 << n which is always perfectly fine (on 32 bit sys). This is because the former is equivalent to ~(int)uint8 << n. The key to avoiding these subtle but disastrous bugs is to learn about the various forms of implicit type promotion in C.
12

Your method has potential implementation defined behavior as well as undefined behavior:

  • storing values into the array of type char beyond the range of type char has implementation defined behavior: buf[0] = inVal & 0xff; and the next 3 statements (inVal & 0xff might be larger than CHAR_MAX if char type is signed by default).

  • left shifting negative values invokes undefined behavior: if any of the 3 first bytes in the array becomes negative as the implementation defined result of storing a value larger than CHAR_MAX into it, the resulting outVal becomes negative, left shifting it is undefined.

In your specific example, your architecture uses 2's complement representation for negative values and the type char is signed. The value stored into buf[0] is 67502978 & 0xff = 130, becomes -126. The last statement outVal |= buf[0]; sets bits 7 through 31 of outVal and the result is -126.

You can avoid these issues by using an array of unsigned char and values of type unsigned int:

#include <stdio.h>

int main(void) {
    unsigned int inVal = 0, outVal = 0;
    unsigned char buf[4] = { 0 };

    inVal = 67502978;

    printf("inVal: %u\n", inVal);

    buf[0] = inVal & 0xff;
    buf[1] = (inVal >> 8) & 0xff;
    buf[2] = (inVal >> 16) & 0xff;
    buf[3] = (inVal >> 24) & 0xff;

    outVal = buf[3];
    outVal <<= 8;
    outVal |= buf[2];
    outVal <<= 8;
    outVal |= buf[1];
    outVal <<= 8;
    outVal |= buf[0];

    printf("outVal: %u\n", outVal);
    return 0;
}

Note that the above code still assumes 32-bit ints.

5 Comments

Actually, it doesn't assume 8-bit bytes. It assumes that unsigned char can hold at least 8 bits (which the standard guarantees). (For the calculation of outval it also assumes that buf only contains valid values in the range 0..255.)
The program does not have undefined behavior. It also does not have implementation defined behavior since all values stored in the chars are masked by 0xff and thus representable in a char which is guaranteed to have at least 8 bits.
@PeterA.Schneider: it does have implementation defined behavior in C: 6.3.1.3 Signed and unsigned integers: When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged. 2 Otherwise, if the new type is unsigned, ... 3 Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised. The values masked by 0xff may exceed the range of type char, and 130 indeed does.
@PeterA.Schneider: The program does not invoke undefined behavior for the specific value inVal = 67502978, but the method used does for many other values, such as inVal = 32768.
For the avoidance of doubt, use std types with fixed widths uint8_t and uint32_t for example. As an embedded coder I only ever use these types (never just int for example) as I've learned that assumption will bite you eventually.
6

While bit shifts of signed values can be a problem, this is not the case here (all left hand values are positive, and all results are within the range of a 32 bit unsigned int).

The problematic expression with somewhat unintuitive semantics is the last bitwise OR:

outVal |= buf[0];

buf[0] is a (on your and my architecture) signed char with the value -126, simply because the most significant bit in the least significant byte of 67502978 is set. In C all operands in an arithmetic expression are subject to the arithmetic conversions. Specifically, they undergo integer promotion which states: "If an int can represent all values of the original type [...], the value is converted to an int". Accordingly, the signed character buf[0] is converted to a (signed) int, preserving its value of -126. A negative signed int has the sign bit set. ORing that with another signed int sets the result's sign bit as well, making that value negative. That is exactly what we are seeing.

Making the bytes unsigned chars fixes the issue because the value of the temporary integer to which the unsigned char is converted is then a simple 8 bit value of 130.

Comments

4

C++ standard N3936 quotes about shift operators:

The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled.

If E1 has an unsigned type,

the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type.

Otherwise, if E1 has a signed type and non-negative value,

and E1 × 2^E2 is representable in the corresponding unsigned type of the result type, then that value, converted to the result type, is the resulting value; otherwise, the behavior is undefined.

So, to avoid undefined behaviour, it is recommended to use unsigned data types, and ensure the 64-bits length of data type.

1 Comment

Hm. I see the following values for outval before each shift: (32 bit little endian ints): 4; 1030; 263680. None of them are negative, and the results are all within range, so they are all well defined. The value after the last shift is 67502848. It's the last OR that creates the negative value.
4

Use unsigned char buf[5] = {0}; and unsigned int for inVal and outVal, and it should work.

When using signed integral types, there arise two sorts of problems:

First, if buf[3] is negative, then due to outVal = buf[3] variable outVal becomes negative; consequent bit shift operators on outVal are then undefined behaviour cppreference.com concerning bit shift operators:

For signed and positive a, the value of a << b is a * 2b if it is representable the return type, otherwise the behavior is undefined. (until C++14), the value of a << b is a * 2b if it is representable in the unsigned version of the return type (which is then converted to signed: this makes it legal to create INT_MIN as 1<<31), otherwise the behavior is undefined. (since C++14)

For negative a, the behavior of a << b is undefined.

Note that with OP's inVal = 67502978 this does not occur, since buf[3]=4; But for other inVals it may occur and then may bring problems due to "undefined behaviour".

The second problem is that with operation outVal |= buf[0] with buf[0]=-126, the value (char)-126, which in binary format is 10000010, is converted to (int)-126, which in binary format is 11111111111111111111111110000010 before operator |= is applied, and this then will fill up outVal with a lot of 1-bits. The reason for conversion is defined at conversion rules for arithmetic operations (cppreference.com):

If both operands are signed or both are unsigned, the operand with lesser conversion rank is converted to the operand with the greater integer conversion rank

So the problem in OP's case is actually not because of any undefined behaviour, but because of having character buf[3] being a negative value, which is converted to int before |= operation.

Note, however, that if either buf[2] or buf[1] had been negative, this would have made outVal negative and would have lead to undefined behaviour on subsequent shift operations, too.

11 Comments

Where do you see an overflowing integer operation? All shifts are done on (afaics none-overflowing) ints.
This is not an integer overflow, it is a direct manipulation of the sign bits. It is undefined behavior per C11 6.5.7/4: "The result of E1 << E2 is E1 left-shifted E2 bit positions;" /--/ "If E1 has a signed type and nonnegative value, and E1 × 2E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined."
@Lundin: You are right, since there is a more specific specification of bit shift's behaviour. I actually thought it would also violate the more general overflow rule, but maybe I am wrong.
@Lundin All shits are well defined, cf. my comment to Saurav's post. Stephan: The shifts are not the issue here.
@PeterA.Schneider - ignore my comment. It was wrong.
|
3

This may be a terrible idea but I'll post it here for interest - you can use a union:

union my_data
{
    uint32_t one_int;

    struct
    {
        uint8_t  byte3;
        uint8_t  byte2;
        uint8_t  byte1;
        uint8_t  byte0;
    }bytes;
};


// Your original code modified to use union my_data
#include <stdio.h>

int main(void) {
    union my_data data;
    uint32_t inVal = 0, outVal = 0;
    uint8_t buf[4] = {0};

    inVal = 67502978;

    printf("inVal: %u\n", inVal);

    data.one_int = inVal;

    // Populate bytes into buff    
    buf[3] = data.bytes.byte3;
    buf[2] = data.bytes.byte2;
    buf[1] = data.bytes.byte1;
    buf[0] = data.bytes.byte0;

    return 0;
}

I don't know if this would also work, can't see why not:

union my_data
{
    uint32_t one_int;
    uint8_t  bytes[4];
};

5 Comments

+1 This is the best way of doing this. It is both non-portable, and undefined behaviour, but so is anything assuming that assumes you can translate between an int and four chars. I would combine with Jim D.'s answer to ensure endian-correctness, however.
Also, I'm pretty sure you've made an error in your code beginning with outVal = data.bytes.byte3, etc.
Thanks Jack, well spotted. I've updated code now to reflect the task at hand (getting bytes into buf from inVal).
That's precisely not what unions are for. Anyway, you forgot #pragma pack 1. Alternatively, uint8_t [4] is much safer than 4 independent fields because it avoids the packing issue.
I did say it's probably a terrible idea, so I was right.
2

Because of endian differences between architectures, it is best practice to convert numeric values to network order, which is big-endian. On receipt, they can then be converted to the native host order. We can do this in a portable way by using htonl() (host to network "long" = uint32_t), and convert to host order on receipt with ntohl(). Example:

#include <stdio.h>
#include <arpa/inet.h>

int main(int argc, char **argv) {
  uint32_t inval = 67502978, outval, backinval;

  outval = htonl(inval);
  printf("outval: %d\n", outval);
  backinval = ntohl(outval);
  printf("backinval: %d\n", backinval);
  return 0;
}

This gives the following result on my 64 bit x86 which is little endian:

$ gcc -Wall example.c
$ ./a.out
outval: -2113731068
backinval: 67502978
$

10 Comments

While your remark raises an important point for the binary exchange of values, it does not explain the OP's problem. Using a different byte order would not solve the problem either.
While @chqrlie has a point, +1 for the mandatory pointer to htonl and friends. I suspect that many spontaneous resets of the various embedded systems in my household could be avoided if people stopped attempts at re-implementing htonl etc. Disseminating best practice is a service to mankind. Especially because the attempts are usually not a sign of competence and thus bound to fail.
I really really don't like the interface of htonl et al. htonl should return an array of octets - not an int (I can imagine a platform where htonl of a valid integer ends up returning a trap value).
A little endian sign and magnitude or ones-complement machine where negative zero is a trap. A 0x00000080 would end up as a big-endian 0x80000000 which is the trap value.
@MartinBonner Although it is called htonl() it doesn't return an int, it returns a uint32_t. I would expect that 0x80000000 is a valid unsigned 32 bit integer on all architectures.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.