Jon Skeet describes the actual implementation very well, but let's consider why it would be nonsensical for the CLR to store strings directly in an array.
The first reason is that storing strings directly in the array would harm performance. If strings were stored directly in an array, then to get to the element 1000 of the array the CLR would have to walk through the bytes of all the strings in the array until it reached element 1000, checking all the while for string boundaries. Since strings and any other reference types are stored in arrays as references, finding the right element of the array requires one multiplication, one addition, and following one pointer (the notion of a pointer here is at the implementation level, not the programmer-visible level). This produces much better performance.
The second reason that strings cannot reasonably be stored directly in an array is that C# arrays of reference type are covariant. Let's say that strings were stored directly in the array generated with
string[] strings = new string[] {"one", "two", "three"};
Then, you cast this to an object array, which is legal
object[] objs = (object[])strings;
How is the compiler supposed to generate code that takes this possibility into account? A method that takes an object array as a parameter can have a string array passed to it, so the CLR needs to know whether to index into the array as an object array, or a string array, or some other type of array. Somehow, at runtime every array would have to be marked with the type declaration of the array, and every array access would have to check the type declaration and then traverse the array differently depending on the type of the array. It's far simpler to stick with references, which allow a single implementation of array accesses and improve performance to boot.