Timeline for How do I get a consistent byte representation of strings in C# without manually specifying an encoding?
Current License: CC BY-SA 4.0
97 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Mar 14, 2024 at 15:48 | history | unprotected | casperOne | ||
| Sep 26, 2022 at 23:26 | answer | added | Michel Diemer | timeline score: 3 | |
| Jun 1, 2022 at 18:16 | comment | added | Andrew Morton | Where did the string come from? It might be possible to read bytes from the original source instead of going via a string. | |
| Apr 12, 2022 at 18:16 | comment | added | Karl Stephen | Also, why should encoding even be taken into consideration? Because the bytes you get through your program are bytes produced by a default encoding, likely UTF16 LittleEndian on a .Net Windows platform. The day the system environment changes your data will likely become USELESS GARBAGE ! You just want to write binary files for your own use through your program on a computer that would stop to get updates at some point, it's okay. But don't come to others under different architecture and/or other endianness without specifying the encoding you used to produce the bytes. | |
| Oct 3, 2020 at 10:27 | review | Close votes | |||
| Oct 7, 2020 at 0:01 | |||||
| Sep 7, 2020 at 1:26 | review | Close votes | |||
| Sep 11, 2020 at 0:03 | |||||
| Aug 3, 2020 at 20:44 | answer | added | Chris Hutchinson | timeline score: 2 | |
| Feb 26, 2020 at 22:22 | history | edited | John Smith | CC BY-SA 4.0 |
added 5 characters in body
|
| Sep 11, 2019 at 4:21 | answer | added | jpmc26 | timeline score: 3 | |
| S Oct 1, 2018 at 12:36 | history | suggested | Dragonthoughts |
This relates strongly to character encoding
|
|
| Oct 1, 2018 at 11:23 | review | Suggested edits | |||
| S Oct 1, 2018 at 12:36 | |||||
| Jul 2, 2018 at 20:51 | answer | added | Jason Goemaat | timeline score: 8 | |
| Jun 27, 2018 at 11:21 | comment | added | Thanasis Ioannidis | You should always worry about what encoding your string is represented in the byte array. The assumption that the string is represented in-memory with a byte array is arbitrary. It happens to be like that in the present implementation of .net. No one can guarantee you it won't change to a linked-list implementation in the future (or any other exotic data structure). Even if you use the same system and the same program to read back the encrypted data, there is always a chance a future patch of .net will break everything apart because you didn't explicity specify in what Encoding you work | |
| Jun 27, 2018 at 11:16 | comment | added | Thanasis Ioannidis | Not worrying about encoding is one thing. Not wanting to specify an encoding is an entirely another thing. If what brings you frustration is what encoding you should use, just pick one and use it all the times for conversions between string to byte array and byte array to string. For instance, always use Unicode, or UTF-8. Your choice. After you have chosen an Encoding, you need not to worry any more and your problem is solved. But if your frustration comes from the need to specify an encoding then you better get used to it, because either you like it or not, an encoding is taking place. | |
| Jan 10, 2018 at 20:21 | answer | added | John Rasch | timeline score: 17 | |
| S Dec 18, 2017 at 19:05 | history | edited | Servy | CC BY-SA 3.0 |
deleted 38 characters in body
|
| Dec 18, 2017 at 17:41 | review | Suggested edits | |||
| S Dec 18, 2017 at 19:05 | |||||
| Dec 5, 2017 at 16:23 | comment | added | mg30rg | Encoding is necessary because the size - in bytes - of the represented characters depends on it, and not only because sizeof(char) is different for i.e. ASCII (1 byte) and WideString(2 bytes), but because it can even vary - in case of UTF-8 a character is represented as 1 to 4 bytes | |
| Nov 8, 2017 at 18:21 | answer | added | NH. | timeline score: 2 | |
| Oct 2, 2017 at 16:32 | review | Close votes | |||
| Oct 6, 2017 at 0:05 | |||||
| Jul 24, 2017 at 9:36 | comment | added | Jeppe Stig Nielsen |
Your first comment (quote): Every string is stored as an array of bytes right? Why can't I simply have those bytes? No, every string is (more or less) stored as an array of 16-bit code units which correspond to UTF-16. There will be surrogate pairs in there if your string contains Unicode characters outside plane 0. You can get that representation easily: var array1 = yourString.ToCharArray(); If for some reason you want the code units as UInt16 values, do var array2 = Array.ConvertAll<char, ushort>(array1, x => x);. That is a ushort[] there.
|
|
| Apr 28, 2017 at 13:59 | comment | added | Kris Vandermotten |
Are you assuming that System.Text.Encoding.Unicode.GetBytes(); is doing some kind of expensive conversion that you want to avoid? If so, your assumption is wrong.
|
|
| Apr 20, 2017 at 8:36 | comment | added | Ark-kun | @AgnelKurian "He wants me to take care of writing and reading those numbers. I am not interpreting them." - If you weren't interpreting them, you'd have bytes and not "numbers". Then, your question disappears. If you have "numbers", that means you've already interpreted/decoded them and threw away the original byte data. And now you want to try and reconstruct the data (encode) which might not be even possible. What it the numbers were actually base-10 and by cramming them into base-2 floats, you've destroyed them forever? Don't want to encode? Don't decode then. Want bytes? Then use bytes. | |
| Jan 9, 2017 at 1:15 | history | edited | Peter Mortensen | CC BY-SA 3.0 |
Copy edited.
|
| Aug 30, 2016 at 10:21 | review | Suggested edits | |||
| Aug 30, 2016 at 11:34 | |||||
| Mar 5, 2016 at 15:00 | history | edited | justhalf | CC BY-SA 3.0 |
Reword (with slight change in meaning) to make it more accurate in describing OP's use case, which is very specific (not string-to-byte conversion in general use case). Include comments from OP into the question to make the use case, which is very specific, clearer.
|
| Feb 11, 2016 at 19:32 | answer | added | Mojtaba Rezaeian | timeline score: 0 | |
| Jan 21, 2016 at 17:19 | answer | added | IgnusFast | timeline score: -5 | |
| Aug 18, 2015 at 17:04 | answer | added | Gerard ONeill | timeline score: 8 | |
| Jun 30, 2015 at 14:39 | answer | added | alireza amini | timeline score: 1 | |
| Apr 24, 2015 at 9:47 | history | edited | Peter Mortensen | CC BY-SA 3.0 |
Copy edited. Removed historical information (e.g. ref. <http://meta.stackexchange.com/a/230693> and <http://meta.stackoverflow.com/questions/266164>).
|
| Jan 21, 2015 at 14:05 | answer | added | Piero Alberto | timeline score: -1 | |
| Dec 17, 2014 at 21:23 | comment | added | Greg D | @AgnelKurian: Are you trolling me? That question doesn't make sense. I could infer that you meant something like, "...store information about the encoding that was used 1000 times for 1000 different string." Nobody ever said anything about doing that, though, and it was explicitly denied earlier when I stated "The encoding of that string is an implicit part of the serialized contract..." so you couldn't have meant that. | |
| Dec 17, 2014 at 2:42 | comment | added | Agnel Kurian | @GregD so you want to store the same encoding 1000 times for 1000 different strings? | |
| Dec 15, 2014 at 18:28 | comment | added | Greg D | @Agnel Kurian: If you're writing arbitrary binary data, write binary data. That has nothing to do with the original question (which is fundamentally about serializing a string). | |
| Dec 13, 2014 at 3:36 | comment | added | Agnel Kurian | @Greg D, Let's say my client has some floating point numbers in some exotic format used to store astronomical distances. He uses just that one format. He wants me to take care of writing and reading those numbers. I am not interpreting them. My client interprets the numbers and all he needs to give me are the bytes I need to write. When reading, all he needs from me are the bytes I have written. Storing a format flag each time in addition to the bytes is a waste of space when he is using just one format for all numbers. | |
| Dec 12, 2014 at 22:44 | comment | added | Greg D | Four years later, I stand by my original comment on this question. It's fundamentally flawed because the fact that we're talking about a string implies interpretation. The encoding of that string is an implicit part of the serialized contract, otherwise it's just a bunch of meaningless bits. If you want meaningless bits, why generate them from a string at all? Just write a bunch of 0's and be done with it. | |
| Nov 25, 2014 at 10:29 | answer | added | Jodrell | timeline score: 4 | |
| Nov 3, 2014 at 21:50 | comment | added | usr | @Mehrdad the existing answers were already invalid (not what was asked). Yours is pretty much the only answer that actually answers just what was asked. (I recommend, though, that you edit your answer to include a few warnings that this approach is really almost never the best one.) | |
| Nov 3, 2014 at 21:37 | comment | added | user541686 | @usr: you just invalidated almost all the answers with your edit, and also made it harder for people to find this question with their natural search query (but you probably did that intentionally). | |
| Nov 3, 2014 at 20:18 | history | edited | usr | CC BY-SA 3.0 |
Edited the title to make it more obvious what approach is being asked here (the wrong one!)
|
| Sep 9, 2014 at 11:30 | answer | added | Jarvis Stark | timeline score: 17 | |
| Aug 28, 2014 at 16:14 | answer | added | George | timeline score: 0 | |
| Aug 28, 2014 at 15:43 | comment | added | George | A char is not a byte and a byte is not a char. A char is both a key into a font table and a lexical tradition. A string is a sequence of chars. (A words, paragraphs, sentences, and titles also have their own lexical traditions that justify their own type definitions -- but I digress). Like integers, floating point numbers, and everything else, chars are encoded into bytes. There was a time when the encoding was simple one to one: ASCII. However, to accommodate all of human symbology, the 256 permutations of a byte were insufficient and encodings were devised to selectively use more bytes. | |
| Jun 11, 2014 at 11:29 | answer | added | Vijay Singh Rana | timeline score: 2 | |
| Apr 9, 2014 at 12:39 | answer | added | WonderWorker | timeline score: -1 | |
| S Mar 18, 2014 at 9:43 | history | suggested | Newbee | CC BY-SA 3.0 |
removing tag from title
|
| Mar 18, 2014 at 9:42 | review | Suggested edits | |||
| S Mar 18, 2014 at 9:43 | |||||
| Dec 2, 2013 at 4:43 | answer | added | Tom Blodget | timeline score: 105 | |
| Oct 22, 2013 at 12:55 | answer | added | mashet | timeline score: 10 | |
| Sep 27, 2013 at 23:26 | answer | added | Thomas Eding | timeline score: -12 | |
| Sep 2, 2013 at 11:21 | answer | added | Shyam sundar shah | timeline score: 6 | |
| Aug 5, 2013 at 22:04 | comment | added | Travis Watson |
@AgnelKurian, A char is a struct that just happens to currently store values as a 16-bit number (UTF-16). What you're really asking (get the character bytes) isn't theoretically possible because it doesn't theoretically exist. A char or string has no Encoding by definition. What if the memory representation changed to UTF-32? Your "get the bytes, shove them back" would fail due to Encoding because you avoided Encoding. So "Why this dependency on encoding?!!!" Depend on Encoding so your code is dependable.
|
|
| Jul 6, 2013 at 12:06 | review | Close votes | |||
| Jul 6, 2013 at 17:14 | |||||
| Jul 6, 2013 at 11:47 | comment | added | adamjcooper | possible duplicate of How do you convert a string to a byte array in .Net | |
| Jun 27, 2013 at 19:25 | history | protected | Paŭlo Ebermann | ||
| Jun 12, 2013 at 3:34 | review | Suggested edits | |||
| Jun 12, 2013 at 3:37 | |||||
| Jun 5, 2013 at 10:52 | answer | added | Shyam sundar shah | timeline score: 23 | |
| Jan 23, 2013 at 6:21 | answer | added | sagardhavale | timeline score: -4 | |
| Jan 15, 2013 at 11:43 | answer | added | Tommaso Belluzzo | timeline score: 3 | |
| Oct 12, 2012 at 6:43 | history | rollback | Agnel Kurian |
Rollback to Revision 4
|
|
| Oct 11, 2012 at 17:47 | history | edited | artbristol | CC BY-SA 3.0 |
Question is highly misleading in its current form. Added detail from OP's comments to clarify.
|
| Oct 11, 2012 at 9:45 | answer | added | Avlin | timeline score: 1 | |
| Apr 30, 2012 at 12:50 | answer | added | Michael Buen | timeline score: 46 | |
| Apr 30, 2012 at 8:45 | vote | accept | Agnel Kurian | ||
| Apr 30, 2012 at 7:44 | answer | added | user541686 | timeline score: 1948 | |
| Apr 30, 2012 at 7:26 | answer | added | Erik A. Brandstadmoen | timeline score: 304 | |
| Jan 2, 2012 at 11:07 | answer | added | user1120193 | timeline score: 1 | |
| Jul 25, 2011 at 22:52 | answer | added | Nathan | timeline score: 42 | |
| Mar 10, 2011 at 8:57 | answer | added | Gman | timeline score: 26 | |
| Mar 22, 2010 at 8:40 | answer | added | Alessandro Annini | timeline score: 9 | |
| Dec 1, 2009 at 19:47 | comment | added | Greg | To play devil's advocate: If you wanted to get the bytes of an in-memory string (as .NET uses them) and manipulate them somehow (i.e. CRC32), and NEVER EVER wanted to decode it back into the original string...it isn't straight forward why you'd care about encodings or how you choose which one to use. | |
| Jul 22, 2009 at 11:30 | comment | added | Alexey Romanov | In case of .NET, the easy route is using UTF-16 on both sides, since that's what .NET uses internally. | |
| Jul 16, 2009 at 11:45 | answer | added | Konamiman | timeline score: 25 | |
| Apr 13, 2009 at 14:14 | comment | added | Lucas Jones | You can take the easy route and just use UTF-8 on both sides. | |
| Apr 13, 2009 at 14:13 | comment | added | Lucas Jones | The encoding is what maps the characters to the bytes. For example, in ASCII, the letter 'A' maps to the number 65. In a different encoding, it might not be the same. The high-level approach to strings taken in the .NET framework makes this largely irrelevant, though (except in this case). | |
| Mar 4, 2009 at 5:51 | comment | added | Agnel Kurian | "A string is an array of chars, where a char is not a byte in the .Net world" Alright, but regardless of the encoding, each character maps to one or more bytes. Can I have those bytes please without having to specify an encoding? | |
| Feb 19, 2009 at 21:03 | answer | added | harmonik | timeline score: 1 | |
| Jan 30, 2009 at 11:02 | vote | accept | Agnel Kurian | ||
| Apr 30, 2012 at 8:45 | |||||
| Jan 23, 2009 at 16:38 | comment | added | Greg D | I think Anthony is trying to address the fundamental disconnect in <300 chars. You're assuming some consistent internal representation of a string, when in fact that representation could be anything. To create, and eventually decode, the bytestream, you must choose an encoding to use. | |
| Jan 23, 2009 at 16:36 | answer | added | Michael Buen | timeline score: 120 | |
| Jan 23, 2009 at 15:54 | answer | added | Joel Coehoorn | timeline score: 53 | |
| Jan 23, 2009 at 14:34 | answer | added | Ed Marty | timeline score: 14 | |
| Jan 23, 2009 at 14:19 | history | edited | Dale Ragan |
Added c# tag.
|
|
| Jan 23, 2009 at 14:15 | answer | added | Hans Passant | timeline score: 11 | |
| Jan 23, 2009 at 14:15 | comment | added | Igal Tabachnik | Have a look at Jon Skeet's answer in a post with the exact question. It will explain why you depend on encoding. | |
| Jan 23, 2009 at 14:05 | comment | added | Agnel Kurian | Every string is stored as an array of bytes right? Why can't I simply have those bytes? | |
| Jan 23, 2009 at 14:03 | answer | added | Zhaph - Ben Duguid | timeline score: 100 | |
| Jan 23, 2009 at 14:00 | comment | added | Greg D | If you're encrypting it, then you'll still have to know what the encoding is after you decrypt it so that you know how to reinterpret those bytes back into a string. | |
| Jan 23, 2009 at 13:57 | comment | added | Agnel Kurian | I'm going to encrypt it. I can encrypt it without converting but I'd still like to know why encoding comes to play here. Just give me the bytes is what I say. | |
| Jan 23, 2009 at 13:56 | comment | added | Greg D | Your confusion over the role of encoding makes me wonder if this is the right question. Why are you trying to convert a string to a byte array? What are you going to do with the byte array? | |
| Jan 23, 2009 at 13:51 | history | edited | kemiller2002 | CC BY-SA 2.5 |
edited title
|
| Jan 23, 2009 at 13:49 | history | edited | Agnel Kurian | CC BY-SA 2.5 |
why encoding
|
| Jan 23, 2009 at 13:43 | answer | added | cyberbobcat | timeline score: -3 | |
| Jan 23, 2009 at 13:43 | answer | added | bmotmans | timeline score: 1143 | |
| Jan 23, 2009 at 13:43 | answer | added | gkrogers | timeline score: 20 | |
| Jan 23, 2009 at 13:39 | history | asked | Agnel Kurian | CC BY-SA 2.5 |