0

I am trying to build a regex parser for a single XML block.

I know people will say that Regex is not a good plan for xml, but I am working with stream data and I just need to know if a complete xml block has been broadcast and is sitting in the buffer.

I am trying to handle for anything between the Opening and closing blocks of the XML and any data in parameters of the main block header.

My example code is below the broken down Regular Expression, if anyone has any input on how to make this as comprehensive as possible I would greatly appreciate it.

Here is my regular expression formatted for visual aid.

I am balancing the group, as well as the group and validating that they do not exist at the end of the expression segments.

/*
   ^(?<TAG>[<]
        (?![?])
        (?<TAGNAME>[^\s/>]*)
    )
    (?<ParamData>
        (
            (\"
                (?>
                    \\\"|
                    [^"]|
                    \"(?<quote>)|
                    \"(?<-quote>)
                )*
                (?(quote)(?!))
                \"
            )|
            [^/>]
        )*?
    )
    (?:
        (?<HASCONTENT>[>])|
        (?<-TAG>
            (?<TAGEND>/[>])
        )
    )
    (?(HASCONTENT)
        (
            (?<CONTENT>
                (
                    (?<inTAG>[<]\<TAGNAME>)(?<-inTAG>/[>])?|
                    (?<-inTAG>[<]/\<TAGNAME>[>])|
                    ([^<]+|[<](?![/]?\<TAGNAME>))
                )*?
                (?(inTAG)(?!))
            )
        )
        (?<TAGEND>(?<-TAG>)[<]/\<TAGNAME>[>])
    )
    (?(TAG)(?!))
*/

Within my class, I expect that any Null object returned means there was no xml block on the queue.

Here is the class I am using.

(I used a literal string (@"") to limit the escape requirements, All " characters were replaced with "" to format properly.

public class XmlDataParser
{
    // xmlObjectExpression defined below to limit code highlight errors
    private Regex _xmlRegex;
    private Regex xmlRegex
    {
        get
        {
            if (_xmlRegex == null)
            {
                _xmlRegex = new Regex(xmlObjectExpression);
            }
            return _xmlRegex;
        }
    }

    private string backingStore = "";

    public bool HasObject()
    {
        return (backingStore != null) && xmlRegex.IsMatch(backingStore);
    }
    public string GetObject()
    {
        string result = null;
        if (HasObject())
        {
            lock (this)
            {
                Match obj = xmlRegex.Match(backingStore);
                result = obj.Value;
                backingStore = backingStore.Substring(result.Length);
            }
        }
        return result;
    }

    public void AddData(byte[] bytes)
    {
        lock (this)
        {
            backingStore += System.Text.Encoding.Default.GetString(bytes);
        }
    }

    private static string xmlObjectExpression = @"^(?<TAG>[<](?![?])(?<TAGNAME>[^\s/>]*))(?<ParamData>((\""(?>\\\""|[^""]|\""(?<quote>)|\""(?<-quote>))*(?(quote)(?!))\"")|[^/>])*?)(?:(?<HASCONTENT>[>])|(?<-TAG>(?<TAGEND>/[>])))(?(HASCONTENT)((?<CONTENT>((?<inTAG>[<]\<TAGNAME>)(?<-inTAG>/[>])?|(?<-inTAG>[<]/\<TAGNAME>[>])|([^<]+|[<](?![/]?\<TAGNAME>)))*?(?(inTAG)(?!))))(?<TAGEND>(?<-TAG>)[<]/\<TAGNAME>[>]))(?(TAG)(?!))";



}
3
  • if all you want to know if it's a complete xml block, pass it to XmlDocument and do load on it. it will be way faster than your regex approach Commented Sep 18, 2013 at 18:15
  • 4
    I know people will say that Regex is not a good plan for xml Regex is not a good plan for xml. Commented Sep 18, 2013 at 18:24
  • >"Regex XML parsing" OOPS... Error in my parser. Rebooting... Commented Sep 18, 2013 at 18:32

1 Answer 1

4

Just use XmlReader and feed it a TextReader. To read streams, you want to change the ConformanceLevel to Fragment.

    XmlReaderSettings settings = new XmlReaderSettings();
    settings.ConformanceLevel = ConformanceLevel.Fragment;
    using (XmlReader reader = XmlReader.Create(tr,settings))
    {
               while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
// this is from my code. You'll rewrite this part :

                        case XmlNodeType.Element:
                            if (t != null)
                            {
                                t.SetName(reader.Name);
                            }
                            else if (reader.Name == "event")
                            {
                                t = new Event1();
                                t.Name = reader.Name;
                            }
                            else if (reader.Name == "data")
                            {
                                t = new Data1();
                                t.Name = reader.Name;
                            }
                            else
                            {
                                throw new Exception("");
                            }

                            break;
                        case XmlNodeType.Text:
                            if (t != null)
                            {
                                t.SetValue(reader.Value);
                            }
                            break;
                        case XmlNodeType.XmlDeclaration:
                        case XmlNodeType.ProcessingInstruction:
                            break;
                        case XmlNodeType.Comment:
                            break;
                        case XmlNodeType.EndElement:
                            if (t != null)
                            {
                                if (t.Name == reader.Name)
                                {

                                    t.Close();
                                    t.Write(output);
                                    t = null;
                                }
                            }

                            break;
                        case XmlNodeType.Whitespace:
                            break;
                    }
                }
    }
Sign up to request clarification or add additional context in comments.

3 Comments

how would I go about getting just the completed blocks out of the stream at this point. And will this work on a TcpStream ?
I greatly appreciate the constructive information about XmlReader, Do you know if XmlReader will work on a tcp stream ? I will begin playing with it shortly to determine if It will accomplish the task I need.
I ended up going with my regex method anyhow as I didnt want to parse partial xml in any event. But this does appear to be a good route to go so I have accepted the answer. Thank you again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.