2

I have a string of attribute names and definitions. I am trying to split the string on the attribute name, into a Dictionary of string string. Where the key is the attribute name and the definition is the value. I won't know the attribute names ahead of time, so I have been trying to somehow split on the ":" character, but am having trouble with that because the attribute name is is not included in the split.

For example, I need to split this string on "Organization:", "OranizationType:", and "Nationality:" into a Dictionary. Any ideas on the best way to do this with C#.Net?

Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)


Here is some sample code to help:

private static void Main()
{
    const string str = "Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)";

    var array = str.Split(':');
    var dictionary = array.ToDictionary(x => x[0], x => x[1]);

    foreach (var item in dictionary)
    {
        Console.WriteLine("{0}: {1}", item.Key, item.Value);
    }

    // Expecting to see the following output:

    // Organization: Name of a governmental, military or other organization.
    // OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party.
    // Nationality: Organization nationality if mentioned in the document. (required)
}

Here is a visual explanation of what I am trying to do:

http://farm5.static.flickr.com/4081/4829708565_ac75b119a0_b.jpg

4
  • welcome at SO. Good to see you accepted a q. Note that it is customary at SO to upvote answers that proved worthwhile to you (i.e., the actual answer and/or others that helped too). A short faq is here: stackoverflow.com/faq Commented Jul 26, 2010 at 10:31
  • After more research with the 3rd party, I found a XML schema I can use (actually RDFS). I'll probably end up using this instead: opencalais.com/files/RDFS%20schema_09Jun16.txt Previously I had been doing a screen scrape of this page: opencalais.com/documentation/calais-web-service-api/… Using a 3rd party scraping tool (dapper), to expose the page as XML: bit.ly/bThM19 Commented Jul 26, 2010 at 10:32
  • Hi Abel, I tried to vote up some answers, however I don't yet have 15 "reputation points", I guess I'm just not cool enough to upvote yet :-) I'll upvote when I reach the 15 point mark, thanks. Commented Jul 26, 2010 at 10:34
  • Paul, I helped you a bit, only 2 more points to go, which you'll get when you accept a next question you ask. PS: +1 from me for the visual explanation, which made this crystal clear. Commented Jul 26, 2010 at 10:38

3 Answers 3

3

I'd do it in two phases, firstly split into the property pairs using something like this:

Regex.Split(input, "\s(?=[A-Z][A-Za-z]*:)")

this looks for any whitespace, followed by a alphabetic string followed by a colon. The alphabetic string must start with a capital letter. It then splits on that white space. That will get you three strings of the form "PropertyName: PropertyValue". Splitting on that first colon is then pretty easy (I'd personally probably just use substring and indexof rather than another regular expression but you sound like you can do that bit fine on your own. Shout if you do want help with the second split.

The only thing to say is be carful in case you get false matches due to the input being awkward. In this case you'll just have to make the regex more complicated to try to compensate.

Sign up to request clarification or add additional context in comments.

2 Comments

Nice Chris! That is splitting the right way (at least the way that matches my expectations). That basically gives me an array of 3 items. I can then simply split each item on ':', index 0 is the attribute name and index 1 is the definition.
+1 Splitting into rows first was definitely the best solution.
1

You would need some delimiter to indicate when it is the end of each pair as opposed to having one large string with sections in between e.g.

Organization: Name of a governmental, military or other organization.|OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) |Nationality: Organization nationality if mentioned in the document. (required)

Notice the | character which is indicating the end of the pair. Then it is just a case of using a very specific delimiter, something that is not likely to be used in the description text, instead of one colon you could use 2 :: as one colon could possibly crop up on occassions as others have suggested. That means you would just need to do:

// split the string into rows
string[] rows = myString.Split('|');
Dictionary<string, string> pairs = new Dictionary<string, string>();
foreach (var r in rows)
{
    // split each row into a pair and add to the dictionary
    string[] split = Regex.Split(r, "::");
    pairs.Add(split[0], split[1]);
}

You can use LINQ as others have suggested, the above is more for readability so you can see what is happening.

Another alternative is to devise some custom regex to do what you need but again you would need to be making a lot of assumptions of how the description text would be formatted etc.

2 Comments

It would be great if I can add a delimiter, unfortunately I don't have control over the input string (it comes from a 3rd party). I'm trying to do some normalization of it to make a structured model. I'm afraid I might have to do a special split like you mentioned. The assumptions I can make are: 1) The attribute name will have no spaces in it. 2) The attribute name will be immediately followed by a ":".
@Paul: Ah ok, I noticed aswell that each description only ever includes 1 fullstop. You could even possibly use that as the delimiter for each row? Although it is a big assumption...
1

Considering that each word in front of the colon always has at least one capital (please confirm), you could solve this by using regular expressions (otherwise you'd end up splitting on all colons, which also appear inside the sentences):

var resultDict = Regex.Split(input, @"(?<= [A-Z][a-zA-Z]+):")
                 .ToDictionary(a => a[0], a => a[1]);

The (?<=...) is a positive look-behind expression that doesn't "eat up" the characters, thus only the colon is removed from the output. Tested with your input here.

The [A-Z][a-zA-Z]+ means: a word that starts with a capital.

Note that, as others have suggested, a "smarter" delimiter will provide easier parsing, as does escaping the delimiter (i.e. like "::" or ":" when you are required to use colons. Not sure if those are options for you though, hence the solution with regular expressions above.

Edit

For one reason or another, I kept getting errors with using ToDictionary, so here's the unwinded version, at least it works. Apologies for earlier non-working version. Not that the regular expression is changed, the first did not include the key, which is the inverse of the data.

var splitArray = Regex.Split(input, @"(?<=( |^)[A-Z][a-zA-Z]+):|( )(?=[A-Z][a-zA-Z]+:)")
                            .Where(a => a.Trim() != "").ToArray();

Dictionary<string, string> resultDict = new Dictionary<string, string>();
for(int i = 0; i < splitArray.Count(); i+=2)
{
    resultDict.Add(splitArray[i], splitArray[i+1]);
}

Note: the regular expression becomes a tad complex in this scenario. As suggested in the thread below, you can split it in smaller steps. Also note that the current regex creates a few empty matches, which I remove with the Where-expression above. The for-loop should not be needed if you manage to get ToDictionary working.

4 Comments

Very cool and good to know about positive look-behind expression. I tried it, however I am still faced with the issue of the attribute name being left in "next" see this image: farm5.static.flickr.com/4093/4830339922_1945af206d_b.jpg
@Darin and @Paul: aha, I notice that. Fixing...!
I personally think its a bit complicated to try to split it all in one go. Much easier to split first by your name/value pairs and then split those pairs apart. Its goign to be much more understandable at the end of the day than if you're trying to split on both.
@Chris, I wholeheartedly agree. Feel free to split as much as needed. The first solution above splits in key+data per item. After that it becomes trivial. The second solution does it all in one go.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.