4

I'm working with text contained in JS variables on a webpage and extracting strings using regex, then turning it into JSON objects in python using json.loads().

The issue I'm having is the unquoted "keys". Right now, I'm doing a series of replacements (code below) to "" each key in each string, but what I want is to dynamically identify any unquoted keys before passing the string into json.loads().

Example 1 with no space after : character

json_data1 = '[{storeName:"testName",address:"12345 Road",address2:"Suite 500",city:"testCity",storeImage:"http://www.testLink.com",state:"testState",phone:"999-999-9999",lat:99.9999,lng:-99.9999}]'

Example 2 with space after : character

json_data2 = '[{storeName: "testName",address: "12345 Road",address2: "Suite 500",city: "testCity",storeImage: "http://www.testLink.com",state: "testState",phone: "999-999-9999",lat: 99.9999,lng: -99.9999}]'

Example 3 with space after ,: characters

json_data3 = '[{storeName: "testName", address: "12345 Road", address2: "Suite 500", city: "testCity", storeImage: "http://www.testLink.com", state: "testState", phone: "999-999-9999", lat: 99.9999, lng: -99.9999}]'

Example 4 with space after : character and newlines

json_data4 = '''[
{
    storeName: "testName", 
    address: "12345 Road", 
    address2: "Suite 500", 
    city: "testCity", 
    storeImage: "http://www.testLink.com", 
    state: "testState", 
    phone: "999-999-9999", 
    lat: 99.9999, lng: -99.9999
}]'''

I need to create pattern that identifies which are keys and not random string values containing characters such as the string link in storeImage. In other words, I want to dynamically find keys and double-quote them to use json.loads() and return a valid JSON object.

I'm currently replacing each key in the text this way

content = re.sub('storeName:', '"storeName":', content)
content = re.sub('address:', '"address":', content)
content = re.sub('address2:', '"address2":', content)
content = re.sub('city:', '"city":', content)
content = re.sub('storeImage:', '"storeImage":', content)
content = re.sub('state:', '"state":', content)
content = re.sub('phone:', '"phone":', content)
content = re.sub('lat:', '"lat":', content)
content = re.sub('lng:', '"lng":', content)

Returned as string representing valid JSON

json_data = [{"storeName": "testName", "address": "12345 Road", "address2": "Suite 500", "city": "testCity", "storeImage": "http://www.testLink.com", "state": "testState", "phone": "999-999-9999", "lat": 99.9999, "lng": -99.9999}]

I'm sure there is a better way of doing this but I haven't been able to find or come up with a regex pattern to handle these. Any help is greatly appreciated!

1
  • json.loads does not create JSON objects. It takes a valid JSON value (which may or may not include JSON objects), and returns a Python value. Further, why does your data contain such broken pseudo-JSON in the first place? Commented Jan 30, 2018 at 15:38

3 Answers 3

3

That repetition is of course unnecessary. You could put everything into a single regex:

content = re.sub(r"\b(storeName|address2?|city|storeImage|state|phone|lat|lng):", r'"\1":', content)

\1 contains the match within the first (in this case, only) set of parentheses, so "\1": surrounds it with quotes and adds back the colon.

Note the use of a word boundary anchor to make sure we match only those exact words.

Sign up to request clarification or add additional context in comments.

Comments

2

Something like this should do the job: ([{,]\s*)([^"':]+)(\s*:)

Replace for: \1"\2"\3

Example: https://regex101.com/r/oV0udR/1

1 Comment

Thank you for the link example. This is by far the most helpful resource I've come across with respect to Regex.
0

Regex: (\w+)\s?:\s?("?[^",]+"?,?)

Regex demo

import re

text = 'storeName: "testName", '
text = re.sub('(\w+)\s?:\s?("?[^",]+"?,?)', "\"\g<1>\":\g<2>", text)
print(text)

Output: "storeName":"testName",

2 Comments

Re-writing this, it's the same as using r'"\1":\2'' versus '\"\g<1>\":\g<2>' for the replace?
It fails where data contains date and that date has : in it. e.g. timestamp_time":\"2020-06-08 22:40:00.000000 UTC. This regex convert it to timestamp_time":\"2020-06-08 "22":40:00.000000 UTC

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.