I'm trying to detect whether a string is XML/HTML formatted, or some other format like CSV or JSON, which may contain HTML as data, or just generic text which may contain random < or > characters. I am NOT trying to validate complete XML or HTML documents--the strings I am testing may just be snippets of XML/HTML, or they may be snippets of something else. So, my criteria are that the string must contain at least one properly-formatted XML tag, and that tag must start at the beginning of the string, barring any whitespace. (At this point, you may have guessed that I am trying to auto-detect the mime-type of textual content before sending it back to the browser. BTW, I'm in PHP.)
I have a regex that will detect the XML/HTML tag:
~<[a-z]+.*?(>.*?</[a-z]+>|/>)~i
And I have a regex that will tell me if the tag starts the string, ignoring whitespace:
~^\s*<~
Problem is, I cannot figure out how to combine both of these into a single regex. The difficulty seems to stem from the "greedy" aspect of regex, particularly if the subject contains nested tags. Help?
/<([^>]+)>.+?<\/\1>/~^(\s+)?<[a-z]+.*?(>.*?</[a-z]+>|/>)~i?<?xml version="1.0"?><xmltag attr="1" />is valid XML.